analyser

package
v0.0.0-...-6bbbb49 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 27, 2023 License: Apache-2.0 Imports: 13 Imported by: 0

Documentation

Index

Constants

View Source
const (
	Separators  = " \t\n\r"
	Puntuations = ".,;:!?()[]{}\"#@&*+-_\\/'`’~%^<>|="
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Analyser

type Analyser interface {
	// Analyse analyses an imported corpus and returns an analysed corpus.
	Analyse(corpus *Corpus) (*Corpus, error)
}

Analyser is an interface to be implemented by analysers.

func NewLemmatizer

func NewLemmatizer(language string) Analyser

func NewStemmer

func NewStemmer(language string) Analyser

func NewTFIDF

func NewTFIDF() Analyser

func NewTokenizer

func NewTokenizer() Analyser

type Corpus

type Corpus struct {
	// The name of the corpus.
	Name string `json:"name"`

	// Description of the corpus.
	Description string `json:"description"`

	// Source of the corpus.
	Source string `json:"source"`

	// URI of the corpus.
	URI string `json:"uri"`

	// Statistics of the corpus.
	Statistics CorpusStatistics `json:"statistics"`

	// The documents in the corpus.
	Documents []*Document `json:"documents"`

	// The unique terms in the corpus.
	Terms map[string]*Term `json:"terms"`
}

Corpus represents a corpus of documents. A corpus is a collection of documents.

func CorpusFromImport

func CorpusFromImport(imported *importer.Import) *Corpus

CorpusFromImport converts an import to a corpus.

type CorpusStatistics

type CorpusStatistics struct {
	// The number of documents in the corpus.
	NumDocs int `json:"num_docs"`

	// The number of tokens in the corpus.
	NumTokens int `json:"num_tokens"`

	// The number of terms in the corpus.
	NumDocTerms int `json:"num_doc_terms"`

	// The number of unique terms in the corpus.
	NumUniqueTerms int `json:"num_unique_terms"`
}

CorpusStatistics represents statistics of a corpus.

type DocTerm

type DocTerm struct {
	// The value of the root.
	Value string `json:"v"`

	// The tokens that the root is derived from.
	Tokens []*Token `json:"ts"`

	// The TF of the term (calculated by tfidf analyser)
	TF float64 `json:"tf"`

	// The TFIDF of the term (calculated by tfidf analyser)
	TFIDF float64 `json:"tfidf"`
}

DocTerm represents a term in a document. It is the result of processing a token, And it keeps track of the tokens that it is derived from.

type Document

type Document struct {
	// The ID of the document.
	DocID int `json:"doc_id"`

	// URI of the document.
	URI string `json:"uri"`

	// Content of the document.
	Content string `json:"content"`

	// Tokens of the document.
	Tokens []*Token `json:"tokens"`

	// The roots of the document.
	Terms map[string]*DocTerm `json:"terms"`
}

Document represents a document in a corpus. A document is a collection of tokens.

type Lemmatizer

type Lemmatizer struct {
	Language string
}

func (*Lemmatizer) Analyse

func (l *Lemmatizer) Analyse(corpus *Corpus) (*Corpus, error)

type Stemmer

type Stemmer struct {
	Language string
}

func (*Stemmer) Analyse

func (l *Stemmer) Analyse(corpus *Corpus) (*Corpus, error)

type TFIDF

type TFIDF struct {
}

func (*TFIDF) Analyse

func (t *TFIDF) Analyse(corpus *Corpus) (*Corpus, error)

type Term

type Term struct {
	// The value of the root.
	Value string `json:"v"`

	// The tokens that the root is derived from.
	Tokens []*Token `json:"ts"`

	// The documents that the root is derived from.
	Documents []int `json:"ds"`

	// The DF of the term (calculated by tfidf analyser)
	DF float64 `json:"df"`

	// The IDF of the term (calculated by tfidf analyser)
	IDF float64 `json:"idf"`
}

type Token

type Token struct {
	// The index of the word in the document.
	Index int `json:"i"`

	// The original form of the word in the document,
	// before any preprocessing.
	Value string `json:"v"`

	// The ID of the document that the word belongs to.
	DocId int `json:"d"`

	// The position of the word in the document.
	Position TokenPosition `json:"p"`
}

Token represents a word in a document.

type TokenPosition

type TokenPosition struct {
	Start int `json:"s"`
	End   int `json:"e"`
}

type Tokenizer

type Tokenizer struct {
}

func (*Tokenizer) Analyse

func (t *Tokenizer) Analyse(corpus *Corpus) (*Corpus, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL