nlptools

package
v0.0.0-...-62718c5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 15, 2021 License: MIT Imports: 5 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type TFIDF

type TFIDF struct {
	// train document index in TermFreqs
	DocIndex map[string]int
	// term frequency for each train document
	TermFreqs []map[string]int
	// documents number for each term in train data
	TermDocs map[string]int
	// number of documents in train data
	N int
	// words to be filtered
	StopWords map[string]struct{}
	// tokenizer, space is used as default
	Tokenizer string
}

TFIDF is a Term Frequency- Inverse Document Frequency model that is created from a trained NaiveBayes model (they are very similar so you can just train NaiveBayes and convert into TDIDF)

This is not a probabalistic model, necessarily, and doesn't give classification. It can be used to determine the 'importance' of a word in a document, though, which is useful in, say, keyword tagging.

Term frequency is basically just adjusted frequency of a word within a document/sentence: termFrequency(word, doc) = 0.5 * ( 0.5 * word.Count ) / max{ w.Count | w ∈ doc }

Inverse document frequency is basically how little the term is mentioned within all of your documents: invDocumentFrequency(word, Docs) = log( len(Docs) ) - log( 1 + |{ d ∈ Docs | t ∈ d}| )

TFIDF is the multiplication of those two functions, giving you a term that is larger when the word is more important, and less when the word is less important

TFIDF tfidf model

func NewTFIDF

func NewTFIDF() *TFIDF

New new model with default

func NewTokenizer

func NewTokenizer(tokenizer tokenizers.Tokenizer) *TFIDF

NewTokenizer new with specified tokenizer works well in GOLD

func (*TFIDF) AddDocs

func (f *TFIDF) AddDocs(docs ...string)

AddDocs add train documents

func (*TFIDF) AddStopWords

func (f *TFIDF) AddStopWords(words ...string)

AddStopWords add stop words to be filtered

func (*TFIDF) AddStopWordsFile

func (f *TFIDF) AddStopWordsFile(file string) (err error)

AddStopWordsFile add stop words file to be filtered, with one word a line

func (*TFIDF) Cal

func (f *TFIDF) Cal(doc string) (weight map[string]float64)

Cal calculate tf-idf weight for specified document

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL