analyser

package

v0.0.0-...-6bbbb49 Latest Latest Go to latest Published: Nov 27, 2023 License: Apache-2.0 Imports: 13 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/elbombardi/bingo

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
type Analyser
type Corpus
- func CorpusFromImport(imported *importer.Import) *Corpus
type CorpusStatistics
type DocTerm
type Document
type Lemmatizer
- func (l *Lemmatizer) Analyse(corpus *Corpus) (*Corpus, error)
type Stemmer
- func (l *Stemmer) Analyse(corpus *Corpus) (*Corpus, error)
type TFIDF
- func (t *TFIDF) Analyse(corpus *Corpus) (*Corpus, error)
type Term
type Token
type TokenPosition
type Tokenizer
- func (t *Tokenizer) Analyse(corpus *Corpus) (*Corpus, error)

Constants ¶

View Source

const (
	Separators  = " \t\n\r"
	Puntuations = ".,;:!?()[]{}\"#@&*+-_\\/'`’~%^<>|="
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Analyser ¶

type Analyser interface {
	// Analyse analyses an imported corpus and returns an analysed corpus.
	Analyse(corpus *Corpus) (*Corpus, error)
}

Analyser is an interface to be implemented by analysers.

func NewLemmatizer ¶

func NewLemmatizer(language string) Analyser

func NewStemmer ¶

func NewStemmer(language string) Analyser

func NewTFIDF ¶

func NewTFIDF() Analyser

func NewTokenizer ¶

func NewTokenizer() Analyser

type Corpus ¶

type Corpus struct {
	// The name of the corpus.
	Name string `json:"name"`

	// Description of the corpus.
	Description string `json:"description"`

	// Source of the corpus.
	Source string `json:"source"`

	// URI of the corpus.
	URI string `json:"uri"`

	// Statistics of the corpus.
	Statistics CorpusStatistics `json:"statistics"`

	// The documents in the corpus.
	Documents []*Document `json:"documents"`

	// The unique terms in the corpus.
	Terms map[string]*Term `json:"terms"`
}

Corpus represents a corpus of documents. A corpus is a collection of documents.

func CorpusFromImport ¶

func CorpusFromImport(imported *importer.Import) *Corpus

CorpusFromImport converts an import to a corpus.

type CorpusStatistics ¶

type CorpusStatistics struct {
	// The number of documents in the corpus.
	NumDocs int `json:"num_docs"`

	// The number of tokens in the corpus.
	NumTokens int `json:"num_tokens"`

	// The number of terms in the corpus.
	NumDocTerms int `json:"num_doc_terms"`

	// The number of unique terms in the corpus.
	NumUniqueTerms int `json:"num_unique_terms"`
}

CorpusStatistics represents statistics of a corpus.

type DocTerm ¶

type DocTerm struct {
	// The value of the root.
	Value string `json:"v"`

	// The tokens that the root is derived from.
	Tokens []*Token `json:"ts"`

	// The TF of the term (calculated by tfidf analyser)
	TF float64 `json:"tf"`

	// The TFIDF of the term (calculated by tfidf analyser)
	TFIDF float64 `json:"tfidf"`
}

DocTerm represents a term in a document. It is the result of processing a token, And it keeps track of the tokens that it is derived from.

type Document ¶

type Document struct {
	// The ID of the document.
	DocID int `json:"doc_id"`

	// URI of the document.
	URI string `json:"uri"`

	// Content of the document.
	Content string `json:"content"`

	// Tokens of the document.
	Tokens []*Token `json:"tokens"`

	// The roots of the document.
	Terms map[string]*DocTerm `json:"terms"`
}

Document represents a document in a corpus. A document is a collection of tokens.

type Lemmatizer ¶

type Lemmatizer struct {
	Language string
}

func (*Lemmatizer) Analyse ¶

func (l *Lemmatizer) Analyse(corpus *Corpus) (*Corpus, error)

type Stemmer ¶

type Stemmer struct {
	Language string
}

func (*Stemmer) Analyse ¶

func (l *Stemmer) Analyse(corpus *Corpus) (*Corpus, error)

type TFIDF ¶

type TFIDF struct {
}

func (*TFIDF) Analyse ¶

func (t *TFIDF) Analyse(corpus *Corpus) (*Corpus, error)

type Term ¶

type Term struct {
	// The value of the root.
	Value string `json:"v"`

	// The tokens that the root is derived from.
	Tokens []*Token `json:"ts"`

	// The documents that the root is derived from.
	Documents []int `json:"ds"`

	// The DF of the term (calculated by tfidf analyser)
	DF float64 `json:"df"`

	// The IDF of the term (calculated by tfidf analyser)
	IDF float64 `json:"idf"`
}

type Token ¶

type Token struct {
	// The index of the word in the document.
	Index int `json:"i"`

	// The original form of the word in the document,
	// before any preprocessing.
	Value string `json:"v"`

	// The ID of the document that the word belongs to.
	DocId int `json:"d"`

	// The position of the word in the document.
	Position TokenPosition `json:"p"`
}

Token represents a word in a document.

type TokenPosition ¶

type TokenPosition struct {
	Start int `json:"s"`
	End   int `json:"e"`
}

type Tokenizer ¶

type Tokenizer struct {
}

func (*Tokenizer) Analyse ¶

func (t *Tokenizer) Analyse(corpus *Corpus) (*Corpus, error)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL