stopwords

package module
v1.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 23, 2021 License: BSD-2-Clause Imports: 8 Imported by: 0

Documentation

Overview

Package stopwords allows you to customize the list of stopwords

Package stopwords implements the Levenshtein Distance algorithm to evaluate the diference between 2 strings

Package stopwords implements Charikar's simhash algorithm to generate a 64-bit fingerprint of a given document.

Package stopwords contains various algorithms of text comparison (Simhash, Levenshtein)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Clean

func Clean(content []byte, langCode string, cleanHTML bool) []byte

Clean removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func CleanString

func CleanString(content string, langCode string, cleanHTML bool) string

CleanString removes useless spaces and stop words from string content. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func CompareSimhash

func CompareSimhash(a uint64, b uint64) uint8

CompareSimhash calculates the Hamming distance between two 64-bit integers using the Kernighan method.

func DontStripDigits

func DontStripDigits()

DontStripDigits changes the behaviour of the default word segmenter by including 'Number, Decimal Digit' Unicode Category as words

func LevenshteinDistance

func LevenshteinDistance(contentA []byte, contentB []byte, langCode string, cleanHTML bool) int

LevenshteinDistance compute the LevenshteinDistance between 2 strings it removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

func LoadStopWordsFromFile

func LoadStopWordsFromFile(filePath string, langCode string, sep string)

LoadStopWordsFromFile loads a list of stop words from a file filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)

func LoadStopWordsFromString

func LoadStopWordsFromString(wordsList string, langCode string, sep string)

LoadStopWordsFromString loads a list of stop words from a string filePath is the full path to the file to be loaded langCode is a BCP 47 or ISO 639-1 language code (e.g. "en" for English). sep is the string separator (e.g. "\n" for newline)

func OverwriteWordSegmenter

func OverwriteWordSegmenter(expression string)

OverwriteWordSegmenter allows you to overwrite the default word segmenter with your own regular expression

func Simhash

func Simhash(content []byte, langCode string, cleanHTML bool) uint64

Simhash returns a 64-bit simhash representing the content of the string removes useless spaces and stop words from a byte slice. BCP 47 or ISO 639-1 language code (if unknown, we'll apply english filters). If cleanHTML is TRUE, remove HTML tags from content and unescape HTML entities.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL