nlp

package

v0.4.0 Latest Latest Go to latest Published: Nov 24, 2023 License: MIT Imports: 8 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/fluhus/gostuff

Links

Open Source Insights

Documentation ¶

Overview ¶

Package nlp provides basic NLP utilities.

Index ¶

Variables
func Lda(docTokens [][]string, k int) (map[string][]float64, [][]int)
func LdaThreads(docTokens [][]string, k, numThreads int) (map[string][]float64, [][]int)
func Stem(s string) string
func TfIdf(docTokens [][]string) []map[string]float64
func Tokenize(s string, keepStopWords bool) []string

Constants ¶

This section is empty.

Variables ¶

View Source

var LdaVerbose = false

LdaVerbose determines whether progress information should be printed during LDA. For debugging.

View Source

var StopWords = map[string]bool{}/* 569 elements not displayed */

StopWords is a map of stop words, for token filtering. Modifying this map will affect the Tokenize function.

Taken from: http://www.ranks.nl/stopwords

View Source

var Tokenizer = regexp.MustCompile("\\w([\\w']*\\w)?")

Tokenizer splits text into tokens. This regexp represents a single word. Changing this regexp will affect the Tokenize function.

Functions ¶

func Lda ¶

func Lda(docTokens [][]string, k int) (map[string][]float64, [][]int)

Lda performs LDA on the given data. docTokens should contain tokenized documents, such that docTokens[i][j] is the j'th token in the i'th document. k is the number of topics. Returns the topics and token-topic assignment, respective to docTokens.

Topics are returned in a map from word to a probability vector, such that the i'th position is the probability of the i'th topic generating that word. For each i, the i'th position of all words sum to 1.

func LdaThreads ¶

func LdaThreads(docTokens [][]string, k, numThreads int) (map[string][]float64,
	[][]int)

LdaThreads is like the function Lda but runs on multiple subroutines. Calling this function with 1 thread is equivalent to calling Lda.

func Stem ¶

func Stem(s string) string

Stem porter-stems the given word.

func TfIdf ¶

func TfIdf(docTokens [][]string) []map[string]float64

TfIdf returns the TF-IDF scores of the given corpus. For each documet, returns a map from token to TF-IDF score.

TF = count(token in document) / count(all tokens in document)

IDF = log(count(documents) / count(documents with token))

func Tokenize ¶

func Tokenize(s string, keepStopWords bool) []string

Tokenize splits a given text to a slice of stemmed, lowercase words. If keepStopWords is false, will drop stop words.

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
lda-tool Command lda-tool performs LDA on the input documents.	Command lda-tool performs LDA on the input documents.
wordnet Package wordnet provides a WordNet parser and interface.	Package wordnet provides a WordNet parser and interface.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL