gmt

package module

v0.0.0-...-0b198b6 Latest Latest Go to latest Published: Mar 24, 2020 License: Apache-2.0 Imports: 4 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/akurniawan/GMT

Links

Open Source Insights

README ¶

GMT

Golang port of Moses tokenizer and normalizer

You can refer to the following repositories for reference to the original code

Features & Limitation

Currently the port is only for tokenizer and normalizer for english and non-chinese languages. While the original sacremoses has detokenizer and true casing as well, they are not yet currently implemented.

Install

go get github.com/akurniawan/GMT

Usage

Tokenizer

tokenizer := NewTokenizer("en")
text := "This, weird\xbb symbols\u2026 appearing everywhere\xbf"
exptected := "This , weird \xbb symbols \u2026 appearing everywhere \xbf"
tokenized := tokenizer.Tokenize(text, false, true)
println(text == expected)

Normalizer

normalizer := NewNormalizer("en", true, true, true, false, false)
text := "12\u00A0123"
exptected := "12.123"
normalized := normalizer.mlizedmmmmmmmalse, true)
println(text == normalized)

Documentation ¶

Index ¶

func IsAnyAlphabet(text string) bool
func IsInArray(text string, arr []string) bool
func IsLower(text string) bool
func IsNumber(text string) bool
func NonBreakingPrefixesLoader(lang string) (result []string)
func PerlPropsLoader(ext string) string
func RemoveEmptyStringFromSlice(texts []string) (result []string)
type Normalizer
- func NewNormalizer(lang string, penn bool, normQuoteCommas bool, normNumbers bool, ...) *Normalizer
- func (n Normalizer) Normalize(text string) (normalizedText string)
type Replacement
- func Flatten(r [][]Replacement) []Replacement
- func NewReplacement(rgx string, sub string) (replacement Replacement)
type Tokenizer
- func NewTokenizer(lang string) (tokenizer *Tokenizer)
- func (t Tokenizer) Tokenize(text string, aggresiveDashSplits bool, escapeXML bool) (string, []string)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func IsAnyAlphabet ¶

func IsAnyAlphabet(text string) bool

IsAnyAlphabet checks if alphabet character exist at least once in any string

func IsInArray ¶

func IsInArray(text string, arr []string) bool

IsInArray checks if text is available in arr

func IsLower ¶

func IsLower(text string) bool

IsLower checks whether all characters in a string are consisted of lowercase characters

func IsNumber ¶

func IsNumber(text string) bool

IsNumber checks whether all characters in a string are consisted of numbers

func NonBreakingPrefixesLoader ¶

func NonBreakingPrefixesLoader(lang string) (result []string)

NonBreakingPrefixesLoader is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

func PerlPropsLoader ¶

func PerlPropsLoader(ext string) string

PerlPropsLoader is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

func RemoveEmptyStringFromSlice ¶

func RemoveEmptyStringFromSlice(texts []string) (result []string)

RemoveEmptyStringFromSlice will check if any empty string exists on an array and remove them

Types ¶

type Normalizer ¶

type Normalizer struct {
	// contains filtered or unexported fields
}

Normalizer is a golang port of the MOses punctuation normalizer from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl Designs are mostly copied from the python version https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py

func NewNormalizer ¶

func NewNormalizer(lang string, penn bool, normQuoteCommas bool, normNumbers bool, preReplaceUniPunct bool, postRemoveCtrlChars bool) *Normalizer

NewNormalizer create new instance of normalizer. Several parameters are provided to disable specific rules for normalization such as quote normalization, number normalization and unicode normalization

func (Normalizer) Normalize ¶

func (n Normalizer) Normalize(text string) (normalizedText string)

Normalize the incoming text according to pre-defined rules

type Replacement ¶

type Replacement struct {
	// contains filtered or unexported fields
}

Replacement is a tuple consisting of regex and their substitution

func Flatten ¶

func Flatten(r [][]Replacement) []Replacement

Flatten will reduce dimensionality (2d to 1d) of the arguments

func NewReplacement ¶

func NewReplacement(rgx string, sub string) (replacement Replacement)

NewReplacement creates Replacement object

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer is an instance to tokenize text. This is a golang port of Moses Tokenizer from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl Designs are mostly copied from the python version https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py

func NewTokenizer ¶

func NewTokenizer(lang string) (tokenizer *Tokenizer)

NewTokenizer creates new Tokenizer instance with predefined language

func (Tokenizer) Tokenize ¶

func (t Tokenizer) Tokenize(text string, aggresiveDashSplits bool, escapeXML bool) (string, []string)

Tokenize incoming string in accordance to predefined language option. We can choose to enable more aggresive dash splitting such as "foo-bar" to "foo @-@ bar" and escaping XML tags

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL