tfidf

package
v0.0.0-...-e90a610 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 20, 2022 License: MIT Imports: 4 Imported by: 0

README

TF-IDF

Illustration

TF-IDF (Term Frequency-Inverse Document Frequency):The importance of a term is in proportion to the number of appearance in the document, and in inverse proportion to the number of appearance in the corpus.

TF (Term Frequency) = # of term appeared in doc / total # of words in doc

IDF (Inverse Document Frequency) = log( total # of docs / # of docs containing the term )

denominator needed to be adjusted by +1 if none of the doc in corpus contains the term

TF-IDF = TF * IDF

Cosine Similarity = cos(( A * B ) / ( |A| * |B| ) )

Usage

t := tfidf.NewTFIDF()

t.AddDocs("hi there", "how are you", "how do you do")

doc := "where are you"
weight := t.Cal(doc)

fmt.Printf("weight of %s is %+v.\n", doc, weight)

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Corpus

type Corpus struct {
	// TermCount stores all terms appeared in the corpus as key
	// and the # of docs containing the term as value. It is
	// used for calculating idf.
	TermCount map[string]float64
	// Corpus stores all documents with it's content hash as key
	// and the document as value. It is used for calculating idf.
	Documents map[string]Document
}

Corpus defines Corpus

type Document

type Document struct {
	ID      string
	Content string
	Terms   []string
	// contains filtered or unexported fields
}

Document defines Document

type TFIDF

type TFIDF struct {
	Corpus *Corpus
	// contains filtered or unexported fields
}

TFIDF defines TFIDF

func NewTFIDF

func NewTFIDF() *TFIDF

NewTFIDF factory

func NewTFIDFWithTokenizer

func NewTFIDFWithTokenizer(t tokenizer.Tokenizer) *TFIDF

NewTFIDFWithTokenizer factory

func (*TFIDF) AddDoc

func (t *TFIDF) AddDoc(doc string)

AddDoc adds doc to the corpus by: 1. Update Corupus for later calculation of other doc's idf (as numerator) 2. Update TermDocMap for later calculation of other docs's idf (as denominator)

func (*TFIDF) AddDocs

func (t *TFIDF) AddDocs(docs ...string)

Add Docs adds doc in batch

func (*TFIDF) Cal

func (t *TFIDF) Cal(doc string) map[string]float64

Cal calculates tfidf for the doc

func (*TFIDF) CalAll

func (t *TFIDF) CalAll()

CalAll calculates tf-idf for all documents in the corpus

func (*TFIDF) Query

func (t *TFIDF) Query(doc string) map[string]float64

Query returns the calculated similarities of all the document in the corpus with the given doc.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL