tokenizers

package

v0.0.0-...-36c17a1 Latest Latest Go to latest Published: Dec 3, 2022 License: AGPL-3.0 Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/fumiama/jieba

Links

Open Source Insights

Documentation ¶

Overview ¶

Example ¶

package main

import (
	"fmt"

	"github.com/fumiama/jieba/tokenizers"
)

func main() {
	sentence := []byte("永和服装饰品有限公司")

	// default mode
	tokenizer, _ := tokenizers.NewJiebaTokenizerAt("../dict.txt", true, false)
	fmt.Println("Default Mode:")
	for _, token := range tokenizer.Tokenize(sentence) {
		fmt.Printf(
			"Term: %s Start: %d End: %d Position: %d Type: %d\n",
			token.Term, token.Start, token.End, token.Position, token.Type)
	}

	//search mode
	tokenizer, _ = tokenizers.NewJiebaTokenizerAt("../dict.txt", true, true)
	fmt.Println("Search Mode:")
	for _, token := range tokenizer.Tokenize(sentence) {
		fmt.Printf(
			"Term: %s Start: %d End: %d Position: %d Type: %d\n",
			token.Term, token.Start, token.End, token.Position, token.Type)
	}
}

Output:

Default Mode:
Term: 永和 Start: 0 End: 6 Position: 1 Type: 1
Term: 服装 Start: 6 End: 12 Position: 2 Type: 1
Term: 饰品 Start: 12 End: 18 Position: 3 Type: 1
Term: 有限公司 Start: 18 End: 30 Position: 4 Type: 1
Search Mode:
Term: 永和 Start: 0 End: 6 Position: 1 Type: 1
Term: 服装 Start: 6 End: 12 Position: 2 Type: 1
Term: 饰品 Start: 12 End: 18 Position: 3 Type: 1
Term: 有限 Start: 18 End: 24 Position: 4 Type: 1
Term: 公司 Start: 24 End: 30 Position: 5 Type: 1
Term: 有限公司 Start: 18 End: 30 Position: 6 Type: 1

Example (BeleveSearch) ¶

package main

import (
	"fmt"
	"log"
	"os"

	"github.com/blevesearch/bleve"
	_ "github.com/fumiama/jieba/tokenizers"
)

func main() {
	// open a new index
	indexMapping := bleve.NewIndexMapping()

	err := indexMapping.AddCustomTokenizer("jieba",
		map[string]interface{}{
			"file": "../dict.txt",
			"type": "jieba",
		})
	if err != nil {
		log.Fatal(err)
	}

	// create a custom analyzer
	err = indexMapping.AddCustomAnalyzer("jieba",
		map[string]interface{}{
			"type":      "custom",
			"tokenizer": "jieba",
			"token_filters": []string{
				"possessive_en",
				"to_lower",
				"stop_en",
			},
		})

	if err != nil {
		log.Fatal(err)
	}

	indexMapping.DefaultAnalyzer = "jieba"
	cacheDir := "jieba.beleve"
	os.RemoveAll(cacheDir)
	index, err := bleve.New(cacheDir, indexMapping)

	if err != nil {
		log.Fatal(err)
	}

	docs := []struct {
		Title string
		Name  string
	}{
		{
			Title: "Doc 1",
			Name:  "This is the first document we’ve added",
		},
		{
			Title: "Doc 2",
			Name:  "The second one 你 中文测试中文 is even more interesting! 吃水果",
		},
		{
			Title: "Doc 3",
			Name:  "买水果然后来世博园。",
		},
		{
			Title: "Doc 4",
			Name:  "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作",
		},
		{
			Title: "Doc 5",
			Name:  "咱俩交换一下吧。",
		},
	}
	// index docs
	for _, doc := range docs {
		index.Index(doc.Title, doc)
	}

	// search for some text
	for _, keyword := range []string{"水果世博园", "你", "first", "中文", "交换机", "交换"} {
		query := bleve.NewQueryStringQuery(keyword)
		search := bleve.NewSearchRequest(query)
		search.Highlight = bleve.NewHighlight()
		searchResults, err := index.Search(search)
		if err != nil {
			log.Fatal(err)
		}
		fmt.Printf("Result of \"%s\": %d matches:\n", keyword, searchResults.Total)
		for i, hit := range searchResults.Hits {
			rv := fmt.Sprintf("%d. %s, (%f)\n", i+searchResults.Request.From+1, hit.ID, hit.Score)
			for fragmentField, fragments := range hit.Fragments {
				rv += fmt.Sprintf("%s: ", fragmentField)
				for _, fragment := range fragments {
					rv += fmt.Sprintf("%s", fragment)
				}
			}
			fmt.Printf("%s\n", rv)
		}
	}
}

Output:

Result of "水果世博园": 2 matches:
1. Doc 3, (1.099550)
Name: 买<mark>水果</mark>然后来<mark>世博</mark>园。
2. Doc 2, (0.031941)
Name: The second one 你 中文测试中文 is even more interesting! 吃<mark>水果</mark>
Result of "你": 1 matches:
1. Doc 2, (0.391161)
Name: The second one <mark>你</mark> 中文测试中文 is even more interesting! 吃水果
Result of "first": 1 matches:
1. Doc 1, (0.512150)
Name: This is the <mark>first</mark> document we’ve added
Result of "中文": 1 matches:
1. Doc 2, (0.553186)
Name: The second one 你 <mark>中文</mark>测试<mark>中文</mark> is even more interesting! 吃水果
Result of "交换机": 2 matches:
1. Doc 4, (0.608495)
Name: 工信处女干事每月经过下属科室都要亲口交代24口<mark>交换机</mark>等技术性器件的安装工作
2. Doc 5, (0.086700)
Name: 咱俩<mark>交换</mark>一下吧。
Result of "交换": 2 matches:
1. Doc 5, (0.534158)
Name: 咱俩<mark>交换</mark>一下吧。
2. Doc 4, (0.296297)
Name: 工信处女干事每月经过下属科室都要亲口交代24口<mark>交换</mark>机等技术性器件的安装工作

Index ¶

Constants
func JiebaTokenizerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Tokenizer, error)
func NewJiebaTokenizer(dictFile io.Reader, hmm, searchMode bool) (analysis.Tokenizer, error)
func NewJiebaTokenizerAt(dictFilePath string, hmm, searchMode bool) (analysis.Tokenizer, error)
type JiebaTokenizer
- func (jt *JiebaTokenizer) Tokenize(input []byte) analysis.TokenStream

Constants ¶

View Source

const Name = "jieba"

Name is the jieba tokenizer name.

Variables ¶

This section is empty.

Functions ¶

func JiebaTokenizerConstructor ¶

func JiebaTokenizerConstructor(config map[string]interface{}, cache *registry.Cache) (analysis.Tokenizer, error)

JiebaTokenizerConstructor creates a JiebaTokenizer.

Parameter config should contains at least one parameter:

file: the path of the dictionary file or io.Reader.

hmm: optional, specify whether to use Hidden Markov Model, see NewJiebaTokenizer for details.

search: optional, speficy whether to use search mode, see NewJiebaTokenizer for details.

func NewJiebaTokenizer ¶

func NewJiebaTokenizer(dictFile io.Reader, hmm, searchMode bool) (analysis.Tokenizer, error)

NewJiebaTokenizer creates a new JiebaTokenizer.

Parameters:

dictFile: the dictioanry file.

hmm: whether to use Hidden Markov Model to cut unknown words,
i.e. not found in dictionary. For example word "安卓" (means "Android" in
English) not in the dictionary file. If hmm is set to false, it will be
cutted into two single words "安" and "卓", if hmm is set to true, it will
be traded as one single word because Jieba using Hidden Markov Model with
Viterbi algorithm to guess the best possibility.

searchMode: whether to further cut long words into serveral short words.
In Chinese, some long words may contains other words, for example "交换机"
is a Chinese word for "Switcher", if sechMode is false, it will trade
"交换机" as a single word. If searchMode is true, it will further split
this word into "交换", "换机", which are valid Chinese words.

func NewJiebaTokenizerAt ¶

func NewJiebaTokenizerAt(dictFilePath string, hmm, searchMode bool) (analysis.Tokenizer, error)

NewJiebaTokenizerAt creates a new JiebaTokenizer.

Parameters:

dictFilePath: path of the dictioanry file.

hmm: whether to use Hidden Markov Model to cut unknown words,
i.e. not found in dictionary. For example word "安卓" (means "Android" in
English) not in the dictionary file. If hmm is set to false, it will be
cutted into two single words "安" and "卓", if hmm is set to true, it will
be traded as one single word because Jieba using Hidden Markov Model with
Viterbi algorithm to guess the best possibility.

searchMode: whether to further cut long words into serveral short words.
In Chinese, some long words may contains other words, for example "交换机"
is a Chinese word for "Switcher", if sechMode is false, it will trade
"交换机" as a single word. If searchMode is true, it will further split
this word into "交换", "换机", which are valid Chinese words.

Types ¶

type JiebaTokenizer ¶

type JiebaTokenizer struct {
	// contains filtered or unexported fields
}

JiebaTokenizer is the beleve tokenizer for jieba.

func (*JiebaTokenizer) Tokenize ¶

func (jt *JiebaTokenizer) Tokenize(input []byte) analysis.TokenStream

Tokenize cuts input into bleve token stream.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL