nogosari

package module

v0.0.3 Latest Latest Go to latest Published: Sep 23, 2023 License: MIT Imports: 3 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/karincake/nogosari

Links

Open Source Insights

README ¶

Nogosari - NLP for Bahasa

A NLP package for bahasa, based on go-sastrawi (https://github.com/RadhiFadlillah/go-sastrawi), with several modifications to make it procedural, and additional features for advance functionality. It is also worth to note that the symbol removal feature is dropped by default, instead it cleans up the small break and period symbol like coma followed by space, semicolon followed by space, point followed by sepace and point at the end of sentence. However, a full tokenize function is available for use

Basic Concept

There are two things we need to understand:

Dictionary as a list of indexed words for reference.
A word can have more than one function (can be called part/position as well) depending on the structure and the context of a sentence or phrase. Therefore it is good to store the information in a uint variable that is based on binary encoding that marks the status of word-function (true/false).

Given the condition, we can make a dictionary by using map, with the word (string) as the key and word-functions (uint16) as the value.

There are 10 word-functions which can be represented by 10 bit length binary value, as stated in the following list respectively: Noun, Pronoun, Verb, Adj, Adverb, Conjunction, Preposition, Interjection, Numeric, Articula

Note that word-functions are meant to be used in sentence or phrase structure recognition. As for the basic task like stemming, it will not be used.

Documentation ¶

Index ¶

func FullIndex(s string) map[string]uint16
func FullTokenize(s string) []string
func Index(s string) map[string]uint16
func Stem(word string, words map[string]any) string
func Tokenize(s string) []string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func FullIndex ¶ added in v0.0.3

func FullIndex(s string) map[string]uint16

Full version of index which includes symbol removal

func FullTokenize ¶ added in v0.0.3

func FullTokenize(s string) []string

Full version of tokenize which includes symbol removal

func Index ¶ added in v0.0.3

func Index(s string) map[string]uint16

Similar to tokenize, except it excludes symbols and returns map[string]uint for indexing purpose. The `uint16` value indicates the word count

func Stem ¶ added in v0.0.3

func Stem(word string, words map[string]any) string

Convert a word back to its base form based on the given dictionary words uses any to cover the result of both Tokenize and Index

func Tokenize ¶ added in v0.0.3

func Tokenize(s string) []string

Tokenize remove symbols and URLs from s, then split it into words

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
dictionary

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL