Documentation ¶
Overview ¶
CWSharp is a text segmentation package for chinese.
Index ¶
Constants ¶
View Source
const ( PUNCT = iota // .,| [] NUMBER // 12345 12.34 ALPHA // [a-z] WORD // abc 中文 ABC123 wi-fi )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Iterator ¶
type Iterator interface {
Next() *Token
}
Token iterator.
func BigramTokenize ¶
WhitespaceTokenize tokenizes a specified text reader on the N-grams token algorithms.
func WhitespaceTokenize ¶
WhitespaceTokenize tokenizes a specified text reader on the whitespace token algorithms.
type IteratorFunc ¶
type IteratorFunc func() *Token
func (IteratorFunc) Next ¶
func (f IteratorFunc) Next() *Token
type Token ¶
type Token struct { // A token text. Text string // A token type. Type Type // An arbitrary source position location. Pos int }
Token represents a word text and with its kind of type.
type Tokenizer ¶
type Tokenizer interface { // Tokenize reads a text stream and divides into a // sequence of tokens. Tokenize(io.Reader) Iterator }
Tokenizer is an interface that divides text into a sequence of tokens.
type TokenizerFunc ¶
TokenizerFunc is the Tokenizer utility that help wrappered a specified tokenize function as Tokenizer.
Source Files ¶
Click to show internal directories.
Click to hide internal directories.