Przetak: fewer weeds on the Web
Przetak is a library for checking whether a text contains
abusive or vulgar speech in Polish. While it is written in Go,
it can be used by programs written in many other languages
thanks to FFI (Foreign Function Interface).
Przetak is resilient to:
- replicating letters,
- spacing out the words,
- inserting non-letters between letters,
- homograph spoofing, i.e. replacing letters with similar characters.
Also, thanks to its use of character 5-grams, it handles some
frequent misspellings and out-of-vocabulary words composed of
morphemes with an abusive or vulgar meaning.
Przetak finished the Polish contest of cyberbullying detection
PolEval 2019 in second place.
Here is
a paper about Przetak, and here
are the slides from my presentation at AI & NLP Workshop Day 2019.
Installation
First, get the package:
$ go get github.com/MarcinCiura/przetak
Change directory to your ${GOPATH}/src/github.com/MarcinCiura/przetak
and run make
to build the shared library. Depending on your
operating system, the shared library will be called:
libprzetak.so
on Linux,
libprzetak.dylib
on macOS,
przetak.dll
on Windows.
Usage
Przetak's evaluate()
function returns an integer whose
bits with respective values 1, 2, or 4 are set if the input
UTF-8 string contains:
- abusive words,
- vulgar words with negative connotations,
- vulgar words with positive connotations.
The examples
directory showcases the use of Przetak directly from Go
and from several other programming languages via FFI
(Foreign Function Interface).
Author
Marcin Ciura
License
Przetak is licensed under
Apache License, Version 2.0.