stopwords

package module

v1.0.2 Latest Latest Go to latest Published: May 23, 2021 License: BSD-2-Clause Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/anhcraft/stopwords

Documentation ¶

Overview ¶

Package stopwords allows you to customize the list of stopwords

Package stopwords implements the Levenshtein Distance algorithm to evaluate the diference between 2 strings

Package stopwords implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.

Package stopwords contains various algorithms of text comparison (Simhash, Levenshtein)

Index ¶

func Clean(content []byte, langCode string, cleanHTML bool) []byte
func CleanString(content string, langCode string, cleanHTML bool) string
func CompareSimhash(a uint64, b uint64) uint8
func DontStripDigits()
func LevenshteinDistance(contentA []byte, contentB []byte, langCode string, cleanHTML bool) int
func LoadStopWordsFromFile(filePath string, langCode string, sep string)
func LoadStopWordsFromString(wordsList string, langCode string, sep string)
func OverwriteWordSegmenter(expression string)
func Simhash(content []byte, langCode string, cleanHTML bool) uint64

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Clean ¶

func Clean(content []byte, langCode string, cleanHTML bool) []byte

Clean removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func CleanString ¶

func CleanString(content string, langCode string, cleanHTML bool) string

CleanString removes useless spaces and stop words from string content. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func CompareSimhash ¶

func CompareSimhash(a uint64, b uint64) uint8

CompareSimhash calculates the Hamming distance between two 64-bit integers using the Kernighan method.

func DontStripDigits ¶

func DontStripDigits()

DontStripDigits changes the behaviour of the default word segmenter by including 'Number, Decimal Digit' Unicode Category as words

func LevenshteinDistance ¶

func LevenshteinDistance(contentA []byte, contentB []byte, langCode string, cleanHTML bool) int

LevenshteinDistance compute the LevenshteinDistance between 2 strings it removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func LoadStopWordsFromFile ¶

func LoadStopWordsFromFile(filePath string, langCode string, sep string)

LoadStopWordsFromFile loads a list of stop words from a file filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)

func LoadStopWordsFromString ¶

func LoadStopWordsFromString(wordsList string, langCode string, sep string)

LoadStopWordsFromString loads a list of stop words from a string filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)

func OverwriteWordSegmenter ¶

func OverwriteWordSegmenter(expression string)

OverwriteWordSegmenter allows you to overwrite the default word segmenter with your own regular expression

func Simhash ¶

func Simhash(content []byte, langCode string, cleanHTML bool) uint64

Simhash returns a 64-bit simhash representing the content of the string removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL