gmt

package module
v0.0.0-...-0b198b6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 24, 2020 License: Apache-2.0 Imports: 4 Imported by: 0

README

GMT

Golang port of Moses tokenizer and normalizer

You can refer to the following repositories for reference to the original code

  1. Sacremoses
  2. mosesdecoder

Features & Limitation

Currently the port is only for tokenizer and normalizer for english and non-chinese languages. While the original sacremoses has detokenizer and true casing as well, they are not yet currently implemented.

Install

go get github.com/akurniawan/GMT

Usage

Tokenizer

tokenizer := NewTokenizer("en")
text := "This, weird\xbb symbols\u2026 appearing everywhere\xbf"
exptected := "This , weird \xbb symbols \u2026 appearing everywhere \xbf"
tokenized := tokenizer.Tokenize(text, false, true)
println(text == expected)

Normalizer

normalizer := NewNormalizer("en", true, true, true, false, false)
text := "12\u00A0123"
exptected := "12.123"
normalized := normalizer.mlizedmmmmmmmalse, true)
println(text == normalized)

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsAnyAlphabet

func IsAnyAlphabet(text string) bool

IsAnyAlphabet checks if alphabet character exist at least once in any string

func IsInArray

func IsInArray(text string, arr []string) bool

IsInArray checks if text is available in arr

func IsLower

func IsLower(text string) bool

IsLower checks whether all characters in a string are consisted of lowercase characters

func IsNumber

func IsNumber(text string) bool

IsNumber checks whether all characters in a string are consisted of numbers

func NonBreakingPrefixesLoader

func NonBreakingPrefixesLoader(lang string) (result []string)

NonBreakingPrefixesLoader is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

func PerlPropsLoader

func PerlPropsLoader(ext string) string

PerlPropsLoader is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

func RemoveEmptyStringFromSlice

func RemoveEmptyStringFromSlice(texts []string) (result []string)

RemoveEmptyStringFromSlice will check if any empty string exists on an array and remove them

Types

type Normalizer

type Normalizer struct {
	// contains filtered or unexported fields
}

Normalizer is a golang port of the MOses punctuation normalizer from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl Designs are mostly copied from the python version https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py

func NewNormalizer

func NewNormalizer(lang string, penn bool, normQuoteCommas bool, normNumbers bool, preReplaceUniPunct bool, postRemoveCtrlChars bool) *Normalizer

NewNormalizer create new instance of normalizer. Several parameters are provided to disable specific rules for normalization such as quote normalization, number normalization and unicode normalization

func (Normalizer) Normalize

func (n Normalizer) Normalize(text string) (normalizedText string)

Normalize the incoming text according to pre-defined rules

type Replacement

type Replacement struct {
	// contains filtered or unexported fields
}

Replacement is a tuple consisting of regex and their substitution

func Flatten

func Flatten(r [][]Replacement) []Replacement

Flatten will reduce dimensionality (2d to 1d) of the arguments

func NewReplacement

func NewReplacement(rgx string, sub string) (replacement Replacement)

NewReplacement creates Replacement object

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer is an instance to tokenize text. This is a golang port of Moses Tokenizer from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl Designs are mostly copied from the python version https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py

func NewTokenizer

func NewTokenizer(lang string) (tokenizer *Tokenizer)

NewTokenizer creates new Tokenizer instance with predefined language

func (Tokenizer) Tokenize

func (t Tokenizer) Tokenize(text string, aggresiveDashSplits bool, escapeXML bool) (string, []string)

Tokenize incoming string in accordance to predefined language option. We can choose to enable more aggresive dash splitting such as "foo-bar" to "foo @-@ bar" and escaping XML tags

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL