tokenizer

package module
v0.0.0-...-56c1056 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 28, 2018 License: Apache-2.0 Imports: 8 Imported by: 6

README

Multilingual Tokenizer

Introduction

Package tokenizer is a golang library for multilingual tokenization. It is based on the segment package of blevesearch, whose implementation follows the description at Unicode Standard Annex #29.

Usage

go get github.com/liuzl/tokenizer
package main

import (
    "fmt"

    "github.com/liuzl/tokenizer"
)

func main() {
    c := `Life is like a box of chocolates. You never know what you're gonna get.`
    var ret = tokenizer.Tokenize(c)
    for _, term := range ret {
        fmt.Println(term)
    }
}

Implementation Details

  1. Segment UTF-8 string as described at Unicode Standard Annex #29.
  2. Deal with English contractions.
  3. Deal with English possessives.
  4. Deal with Numbers with unit.
  5. SBC case to DBC case conversion.

Licence

This package is licenced under the Apache License 2.0.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	NumberWithUnitRegex = regexp.MustCompile(`^(\d*\.?\d+|\d{1,3}(?:,\d{3})+)([a-zA-Z]{1,3})$`)
	TimeFixRegex        = regexp.MustCompile(`(?i)^(?:\d|[0-3]\d)T(?:\d|[0-2]\d)$`)
)
View Source
var Contractions map[string]Dict = make(map[string]Dict)
View Source
var EngContractions string = `` /* 3148-byte string literal not displayed */

https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions

Functions

func Tokenize

func Tokenize(text string) []string

Types

type Dict

type Dict map[string]*Items

type Items

type Items struct {
	Terms []string
	Norms [][]string
}

type Token

type Token struct {
	Text string
	Norm string
}

func TokenizePro

func TokenizePro(text string) []*Token

func (*Token) String

func (self *Token) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL