nogosari

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 23, 2023 License: MIT Imports: 3 Imported by: 0

README

Nogosari - NLP for Bahasa

A NLP package for bahasa, based on go-sastrawi (https://github.com/RadhiFadlillah/go-sastrawi), with several modifications to make it procedural, and additional features for advance functionality. It is also worth to note that the symbol removal feature is dropped by default, instead it cleans up the small break and period symbol like coma followed by space, semicolon followed by space, point followed by sepace and point at the end of sentence. However, a full tokenize function is available for use

Basic Concept

There are two things we need to understand:

  • Dictionary as a list of indexed words for reference.
  • A word can have more than one function (can be called part/position as well) depending on the structure and the context of a sentence or phrase. Therefore it is good to store the information in a uint variable that is based on binary encoding that marks the status of word-function (true/false).

Given the condition, we can make a dictionary by using map, with the word (string) as the key and word-functions (uint16) as the value.

There are 10 word-functions which can be represented by 10 bit length binary value, as stated in the following list respectively: Noun, Pronoun, Verb, Adj, Adverb, Conjunction, Preposition, Interjection, Numeric, Articula

Note that word-functions are meant to be used in sentence or phrase structure recognition. As for the basic task like stemming, it will not be used.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FullIndex added in v0.0.3

func FullIndex(s string) map[string]uint16

Full version of index which includes symbol removal

func FullTokenize added in v0.0.3

func FullTokenize(s string) []string

Full version of tokenize which includes symbol removal

func Index added in v0.0.3

func Index(s string) map[string]uint16

Similar to tokenize, except it excludes symbols and returns map[string]uint for indexing purpose. The `uint16` value indicates the word count

func Stem added in v0.0.3

func Stem(word string, words map[string]any) string

Convert a word back to its base form based on the given dictionary words uses any to cover the result of both Tokenize and Index

func Tokenize added in v0.0.3

func Tokenize(s string) []string

Tokenize remove symbols and URLs from s, then split it into words

Types

This section is empty.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL