chinese

package module
v0.0.0-...-100fa8a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 27, 2020 License: MIT Imports: 17 Imported by: 0

README

chinese GoDoc

Package chinese provides utilities for dealing with Chinese text, including text segmentation.

Download:

go get github.com/smhanov/chinese

Package chinese provides utilities for dealing with Chinese text, including text segmentation.

Chinese text is commonly written without any spaces between the words. This package uses the viterbi algorithm and word frequency information to find the best placement of spaces in the sentences.

It is designed to take up very little memory. In my tests, loading the default dictionary will use 160MB of RAM. However, the memory used for loading is then immediately released so the total memory consumed for the dictionary of 589000 words and frequencies is 1.1MB

To use it, create a new text segmenter. By default, a model of word frequencies from the web is loaded. Then call Segment() passing in some text. The return value is the text split into strings containing individual words, unrecognized words, or spaces and punctuation. You can get back the original input by concatenating the results together.


Automatically generated by autoreadme on 2019.04.08

Documentation

Overview

Package chinese provides utilities for dealing with Chinese text, including text segmentation.

Chinese text is commonly written without any spaces between the words. This package uses the viterbi algorithm and word frequency information to find the best placement of spaces in the sentences.

It is designed to take up very little memory. In my tests, loading the default dictionary will use 160MB of RAM. However, the memory used for loading is then immediately released so the total memory consumed for the dictionary of 589000 words and frequencies is 1.1MB

To use it, create a new text segmenter. By default, a model of word frequencies from the web is loaded. Then call Segment() passing in some text. The return value is the text split into strings containing individual words, unrecognized words, or spaces and punctuation. You can get back the original input by concatenating the results together.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Model

type Model interface {
	FindAllPrefixesOf(input string) []WordFreq
}

Model is a dictionary that can find all words that are a prefix of the given string.

type Segmenter

type Segmenter struct {
	// contains filtered or unexported fields
}

Segmenter will break chinese text into words, based on a single word frequency model that you provide

func NewSegmenter

func NewSegmenter(args ...interface{}) *Segmenter

NewSegmenter returns a new text segmenter. When passed no arguments, it loads the default model from the web. Otherwise, you must create a model and pass it in as the first argument to use it.

Example

In this example, we load the default model (from the web) and use it to segment some text.

package main

import (
	"fmt"
	"strings"

	"github.com/smhanov/chinese"
)

func main() {
	segments := chinese.NewSegmenter().Segment("我儿子四岁。他的名字叫Zack。")
	fmt.Printf("%s\n", strings.Join(segments, " "))
}
Output:

我 儿子 四岁 。 他 的 名字 叫 Zack。

func (*Segmenter) Segment

func (s *Segmenter) Segment(inputStr string) []string

Segment breaks the input string into separate words. Whitespace or other characters will be returned as their own entry in the result, so the original input can be obtained as the concatenation of the strings in the result.

type WordFreq

type WordFreq struct {
	Word           string
	LogProbability float32
}

WordFreq represents a word and a frequency returned from a model

type WordModel

type WordModel struct {
	// contains filtered or unexported fields
}

WordModel is a structure that can both find all words that are prefixes of a given string, and return the log frequencies of those words.

func LoadModel

func LoadModel(args ...interface{}) (*WordModel, error)

LoadModel returns a model that you open from the given file. The model is a text file. Each line is a word and raw frequency (not log) separated by space. The format is inferred from the first line. If the file ends in .bz2 or .gz, it will be decompressed. If the file is an URL, it will be fetched. If the file is an io.Reader, it will be read from.

func NewWordModel

func NewWordModel() *WordModel

NewWordModel returns a new word model. You must add words to this using AddWord() and then call Finish() before using it.

Example

ExampleSegmentation will create a simple model with some chinese words. Then it will split a sentence.

package main

import (
	"fmt"
	"strings"

	"github.com/smhanov/chinese"
)

func main() {
	model := chinese.NewWordModel()
	model.AddWord("他", 1)
	model.AddWord("儿", 1)
	model.AddWord("儿子", 2)
	model.AddWord("叫", 1)
	model.AddWord("名", 1)
	model.AddWord("名字", 2)
	model.AddWord("四", 1)
	model.AddWord("子", 1)
	model.AddWord("字", 1)
	model.AddWord("岁", 1)
	model.AddWord("的", 1)
	model.Finish()

	segmenter := chinese.NewSegmenter(model)
	segments := segmenter.Segment("我儿子四岁。他的名字叫Zack。")

	fmt.Printf("%s", strings.Join(segments, " "))
}
Output:

我 儿子 四 岁 。 他 的 名字 叫 Zack。

func (*WordModel) AddWord

func (m *WordModel) AddWord(word string, freqCount float32)

AddWord adds a word and log frequency to the model. If frequency of all words is not known, use the length of the word. This will cause the segmenter to try to break the text into the fewest number of words.

Words must be added in alphabetical order, and must not be repeated. Otherwise, it will cause a panic()

func (*WordModel) FindAllPrefixesOf

func (m *WordModel) FindAllPrefixesOf(input string) []WordFreq

FindAllPrefixesOf finds all prefixes of the input string that are words, and returns their log inverse probabilities.

func (*WordModel) Finish

func (m *WordModel) Finish()

Finish signals that the word model is finished and ready to be used.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL