tokenizer

package
v0.0.181 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 19, 2024 License: Apache-2.0 Imports: 4 Imported by: 5

Documentation

Overview

Package for tokenizing of Chinese text into multi-character terms and corresponding English equivalents.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type DictTokenizer

type DictTokenizer[V any] struct {
	// contains filtered or unexported fields
}

Tokenizes Chinese text using a dictionary

func NewDictTokenizer added in v0.0.101

func NewDictTokenizer[V any](wDict map[string]V) *DictTokenizer[V]

func (DictTokenizer[V]) Tokenize

func (tokenizer DictTokenizer[V]) Tokenize(text string) []TextToken

Tokenizes a Chinese text string into words and other terms in the dictionary. If the terms are not found in the dictionary then individual characters will be returned. Compares left to right and right to left greedy methods, taking the one with the least tokens. Long text is handled by breaking the string into segments delimited by punctuation or non-Chinese characters.

type TextSegment added in v0.0.28

type TextSegment struct {

	// The text contained in the segment
	Text string

	// False if punctuation or non-Chinese text
	Chinese bool
}

A text segment that contains either Chinese or non-Chinese text

func Segment added in v0.0.28

func Segment(text string) []TextSegment

Segment a text document into segments of Chinese separated by either puncuation or non-Chinese text.

Example

A basic example of the function Segment

segments := Segment("你好 means hello")
fmt.Printf("Text: %s, Chinese: %t\n", segments[0].Text, segments[0].Chinese)
fmt.Printf("Text: %s, Chinese: %t\n", strings.TrimSpace(segments[1].Text), segments[1].Chinese)
Output:

Text: 你好, Chinese: true
Text: means hello, Chinese: false

type TextToken

type TextToken struct {
	Token     string
	DictEntry dicttypes.Word
	Senses    []dicttypes.WordSense
}

A text token contains the results of tokenizing a string

type Tokenizer

type Tokenizer interface {
	Tokenize(fragment string) []TextToken
}

Tokenizes Chinese text

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL