cwsharp-go: Index | Files | Directories

package cwsharp

import ""

CWSharp is a text segmentation package for chinese.


Package Files

bigram.go mmseg.go readwrite.go token.go tokenize.go whitespace.go


const (
    PUNCT  = iota // .,| []
    NUMBER        // 12345 12.34
    ALPHA         // [a-z]
    WORD          // abc 中文 ABC123 wi-fi

type Iterator Uses

type Iterator interface {
    Next() *Token

Token iterator.

func BigramTokenize Uses

func BigramTokenize(r io.Reader) Iterator

WhitespaceTokenize tokenizes a specified text reader on the N-grams token algorithms.

func WhitespaceTokenize Uses

func WhitespaceTokenize(r io.Reader) Iterator

WhitespaceTokenize tokenizes a specified text reader on the whitespace token algorithms.

type IteratorFunc Uses

type IteratorFunc func() *Token

func (IteratorFunc) Next Uses

func (f IteratorFunc) Next() *Token

type Token Uses

type Token struct {
    // A token text.
    Text string
    // A token type.
    Type Type
    // An arbitrary source position location.
    Pos int

Token represents a word text and with its kind of type.

type Tokenizer Uses

type Tokenizer interface {
    // Tokenize reads a text stream and divides into a
    // sequence of tokens.
    Tokenize(io.Reader) Iterator

Tokenizer is an interface that divides text into a sequence of tokens.

func New Uses

func New(file string) (Tokenizer, error)

New returns a standard tokenizer using a specified lexicon file.

type TokenizerFunc Uses

type TokenizerFunc func(io.Reader) Iterator

TokenizerFunc is the Tokenizer utility that help wrappered a specified tokenize function as Tokenizer.

func (TokenizerFunc) Tokenize Uses

func (f TokenizerFunc) Tokenize(r io.Reader) Iterator

type Type Uses

type Type int

A token type.

func (Type) String Uses

func (typ Type) String() string



Package cwsharp imports 8 packages (graph) and is imported by 2 packages. Updated 2017-11-14. Refresh now. Tools for package owners.