cwsharp

package module

v0.0.0-...-b16415c Latest Latest Go to latest Published: Oct 31, 2017 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/zhengchun/cwsharp-go

README ¶

cwsharp-go

cwsharp-go是Golang实现的中文分词库，支持多种分词模式，支持自定义字典和扩展。

.NET版：CWSharp-C#

Python版: CWSharp-Python

安装&测试

$ go get github.com/zhengchun/cwsharp-go
$ cd main
$ go run main.go Hello,World!你好，世界!

分词算法

cwsharp-go支持多种分词算法，你可以根据需求选择适合自己的或者自定义新的分词算法。

mmseg-tokenizer

标准的基于词典的分词方法。

tips: 建议使用单一实例，避免每次分词都需重新加载字典

tokenizer, err := cwsharp.New("../data/cwsharp.dawg") //加载字典
iter := tokenizer.Tokenize(strings.NewReader("Hello,world!你好,世界!"))
for tok := iter.Next(); tok != nil; tok = iter.Next() {
	fmt.Printf("%s/%s ", tok.Text, tok.Type)
}
>> hello/w ,/p world/w !/p 你好/w ,/p 世界/w !/p

bigram-tokenizer

二元分词方法，无需字典，速度快，支持完整的英文和数字切分。

iter := cwsharp.BigramTokenize(strings.NewReader("世界人民大团结万岁!"))
for token := iter.Next(); token != nil; token = iter.Next() {
	fmt.Printf("%s/%s ", token.Text, token.Type)
}
>> 世界/w 界人/w 人民/w 民大/w 大团/w 团结/w 结万/w 万岁/w !/p

whitespace-tokenizer

标准的英文分词，无需字典，适合切分英文的内容，中文会被当做独立的字符输出。

iter := cwsharp.WhitespaceTokenize(strings.NewReader("Hello,world!你好!"))
for token := iter.Next(); token != nil; token = iter.Next() {
	fmt.Printf("%s/%s ", token.Text, token.Type)
}
>> hello/w ,/p world/w !/p 你/w 好/w !/p

版本历史

2.0 [2017-01]

重写了代码以及目录布局, 尽量将代码简化以及符合golang的使用.
- bigram,mmseg,simple三个独立的包整合到一起.
- golang标准库的io.Reader代替自定义cwsharp.Reader的实现.
- WhitespaceTokenize取代simple包.
- BigramTokenize取代bigram包.
- Token.Type代替Token.Kind. (PUNC,NUMBER,WORD)
- Tokenizer接口约束.
1.1
- 重构架构方面的设计，实现了自定义分词扩展。
1.0
- C#版本的移植

TODO-List

自定义Filter功能,比如StopwordFilter。
将早期版本的自定义字典功能移到新版本中。

Documentation ¶

Overview ¶

CWSharp is a text segmentation package for chinese.

Index ¶

Constants
type Iterator
- func BigramTokenize(r io.Reader) Iterator
- func WhitespaceTokenize(r io.Reader) Iterator
type IteratorFunc
- func (f IteratorFunc) Next() *Token
type Token
type Tokenizer
- func New(file string) (Tokenizer, error)
type TokenizerFunc
- func (f TokenizerFunc) Tokenize(r io.Reader) Iterator
type Type
- func (typ Type) String() string

Constants ¶

View Source

const (
	PUNCT  = iota // .,| []
	NUMBER        // 12345 12.34
	ALPHA         // [a-z]
	WORD          // abc 中文 ABC123 wi-fi
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Iterator ¶

type Iterator interface {
	Next() *Token
}

Token iterator.

func BigramTokenize ¶

func BigramTokenize(r io.Reader) Iterator

WhitespaceTokenize tokenizes a specified text reader on the N-grams token algorithms.

func WhitespaceTokenize ¶

func WhitespaceTokenize(r io.Reader) Iterator

WhitespaceTokenize tokenizes a specified text reader on the whitespace token algorithms.

type IteratorFunc ¶

type IteratorFunc func() *Token

func (IteratorFunc) Next ¶

func (f IteratorFunc) Next() *Token

type Token ¶

type Token struct {
	// A token text.
	Text string
	// A token type.
	Type Type
	// An arbitrary source position location.
	Pos int
}

Token represents a word text and with its kind of type.

type Tokenizer ¶

type Tokenizer interface {
	// Tokenize reads a text stream and divides into a
	// sequence of tokens.
	Tokenize(io.Reader) Iterator
}

Tokenizer is an interface that divides text into a sequence of tokens.

func New ¶

func New(file string) (Tokenizer, error)

New returns a standard tokenizer using a specified lexicon file.

type TokenizerFunc ¶

type TokenizerFunc func(io.Reader) Iterator

TokenizerFunc is the Tokenizer utility that help wrappered a specified tokenize function as Tokenizer.

func (TokenizerFunc) Tokenize ¶

func (f TokenizerFunc) Tokenize(r io.Reader) Iterator

type Type ¶

type Type int

A token type.

func (Type) String ¶

func (typ Type) String() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
dawg
main

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL