nlptools

package

v0.0.0-...-62718c5 Latest Latest Go to latest Published: May 15, 2021 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/broosaction/gotext

Links

Open Source Insights

Documentation ¶

Index ¶

type TFIDF
- func NewTFIDF() *TFIDF
- func NewTokenizer(tokenizer tokenizers.Tokenizer) *TFIDF

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type TFIDF ¶

type TFIDF struct {
	// train document index in TermFreqs
	DocIndex map[string]int
	// term frequency for each train document
	TermFreqs []map[string]int
	// documents number for each term in train data
	TermDocs map[string]int
	// number of documents in train data
	N int
	// words to be filtered
	StopWords map[string]struct{}
	// tokenizer, space is used as default
	Tokenizer string
}

TFIDF is a Term Frequency- Inverse Document Frequency model that is created from a trained NaiveBayes model (they are very similar so you can just train NaiveBayes and convert into TDIDF)

This is not a probabalistic model, necessarily, and doesn't give classification. It can be used to determine the 'importance' of a word in a document, though, which is useful in, say, keyword tagging.

Term frequency is basically just adjusted frequency of a word within a document/sentence: termFrequency(word, doc) = 0.5 * ( 0.5 * word.Count ) / max{ w.Count | w ∈ doc }

Inverse document frequency is basically how little the term is mentioned within all of your documents: invDocumentFrequency(word, Docs) = log( len(Docs) ) - log( 1 + |{ d ∈ Docs | t ∈ d}| )

TFIDF is the multiplication of those two functions, giving you a term that is larger when the word is more important, and less when the word is less important

TFIDF tfidf model

func NewTFIDF ¶

func NewTFIDF() *TFIDF

New new model with default

func NewTokenizer ¶

func NewTokenizer(tokenizer tokenizers.Tokenizer) *TFIDF

NewTokenizer new with specified tokenizer works well in GOLD

func (*TFIDF) AddDocs ¶

func (f *TFIDF) AddDocs(docs ...string)

AddDocs add train documents

func (*TFIDF) AddStopWords ¶

func (f *TFIDF) AddStopWords(words ...string)

AddStopWords add stop words to be filtered

func (*TFIDF) AddStopWordsFile ¶

func (f *TFIDF) AddStopWordsFile(file string) (err error)

AddStopWordsFile add stop words file to be filtered, with one word a line

func (*TFIDF) Cal ¶

func (f *TFIDF) Cal(doc string) (weight map[string]float64)

Cal calculate tf-idf weight for specified document

Source Files ¶

View all Source files

tfidf.go

Directories ¶

Path	Synopsis
en

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL