uax29

package

v0.2.0 Latest Latest Go to latest Published: Dec 8, 2021 License: BSD-3-Clause, Unlicense Imports: 7 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/npillmayer/uax

Links

Open Source Insights

Documentation ¶

Overview ¶

Package uax29 implements Unicode Annex #29 word breaking.

Content ¶

UAX#29 is the Unicode Annex for breaking text into graphemes, words and sentences. It defines code-point classes and sets of rules for how to place break points and break inhibitors. This file is about word breaking.

This segmenter passes all 1823 tests of the Unicode UAX#29 test suite for word breaking.

Typical Usage ¶

Clients instantiate a WordBreaker object and use it as the breaking engine for a segmenter.

onWords := uax29.NewWordBreaker()
segmenter := uax.NewSegmenter(onWords)
segmenter.Init(...)
for segmenter.Next() ...

Attention ¶

Before using word breakers, clients usually should initialize the classes and rules:

SetupUAX29Classes()

This initializes all the code-point range tables. Initialization is not done beforehand, as it consumes quite some memory. However, the word breaker will call it if range tables are not yet initialized.

______________________________________________________________________

License ¶

This project is provided under the terms of the UNLICENSE or the 3-Clause BSD license denoted by the following SPDX identifier:

SPDX-License-Identifier: 'Unlicense' OR 'BSD-3-Clause'

You may use the project under the terms of either license.

Licenses are reproduced in the license file in the root folder of this module.

Index ¶

Variables
func SetupUAX29Classes()
type UAX29Class
- func ClassForRune(r rune) UAX29Class
- func (c UAX29Class) String() string
type WordBreaker
- func NewWordBreaker(weight int) *WordBreaker

Examples ¶

WordBreaker

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	PenaltyForBreak        = 50
	PenaltyToSuppressBreak = 10000
	PenaltyForMustBreak    = -10000
)

Penalties (inter-word optional break, suppress break and mandatory break).

View Source

var ALetter, CR, Double_Quote, Extend, ExtendNumLet, Format, Hebrew_Letter, Katakana, LF, MidLetter,
	MidNum, MidNumLet, Newline, Numeric, Regional_Indicator, Single_Quote, WSegSpace, ZWJ *unicode.RangeTable

Range tables for UAX#29 code-point classes. Will be initialized with SetupUAX29Classes(). Clients can check with unicode.Is(..., rune)

Functions ¶

func SetupUAX29Classes ¶

func SetupUAX29Classes()

SetupUAX29Classes is the top-level preparation function: Create code-point classes for word breaking. Will in turn set up emoji classes as well. (Concurrency-safe).

The word breaker will call this transparently if it has not been called beforehand.

Types ¶

type UAX29Class ¶

type UAX29Class int

Type for UAX#29 code-point classes. Must be convertable to int.

const (
	ALetterClass            UAX29Class = 0
	CRClass                 UAX29Class = 1
	Double_QuoteClass       UAX29Class = 2
	ExtendClass             UAX29Class = 3
	ExtendNumLetClass       UAX29Class = 4
	FormatClass             UAX29Class = 5
	Hebrew_LetterClass      UAX29Class = 6
	KatakanaClass           UAX29Class = 7
	LFClass                 UAX29Class = 8
	MidLetterClass          UAX29Class = 9
	MidNumClass             UAX29Class = 10
	MidNumLetClass          UAX29Class = 11
	NewlineClass            UAX29Class = 12
	NumericClass            UAX29Class = 13
	Regional_IndicatorClass UAX29Class = 14
	Single_QuoteClass       UAX29Class = 15
	WSegSpaceClass          UAX29Class = 16
	ZWJClass                UAX29Class = 17

	Other UAX29Class = 999
)

These are all the UAX#29 breaking classes.

func ClassForRune ¶

func ClassForRune(r rune) UAX29Class

ClassForRune gets the Unicode #UAX29 word class for a Unicode code-point.

func (UAX29Class) String ¶

func (c UAX29Class) String() string

Stringer for type UAX29Class

type WordBreaker ¶

type WordBreaker struct {
	// contains filtered or unexported fields
}

WordBreaker is a Breaker type used by a uax.Segmenter to break text up according to UAX#29 / Words. It implements the uax.UnicodeBreaker interface.

Example ¶

package main

import (
	"fmt"
	"strings"

	"github.com/npillmayer/uax/segment"
	"github.com/npillmayer/uax/uax29"
)

func main() {
	onWords := uax29.NewWordBreaker(1)
	segmenter := segment.NewSegmenter(onWords)
	segmenter.Init(strings.NewReader("Hello World🇩🇪!"))
	for segmenter.Next() {
		fmt.Printf("'%s'\n", segmenter.Text())
	}
}

Output:

'Hello'
' '
'World'
'🇩🇪'
'!'

func NewWordBreaker ¶

func NewWordBreaker(weight int) *WordBreaker

NewWordBreaker creates a a new UAX#29 word breaker.

Usage:

onWords := NewWordBreaker()
segmenter := uax.NewSegmenter(onWords)
segmenter.Init(...)
for segmenter.Next() ...

weight is a multiplying factor for penalties. It must be 0…w…5 and will be capped for values outside this range. Currently this is not used by any test and should probably left to 1.

func (*WordBreaker) CodePointClassFor ¶

func (gb *WordBreaker) CodePointClassFor(r rune) int

CodePointClassFor returns the UAX#29 word code-point class for a rune (= code-point). (Interface uax.UnicodeBreaker)

func (*WordBreaker) LongestActiveMatch ¶

func (gb *WordBreaker) LongestActiveMatch() int

LongestActiveMatch collects from all active recognizers information about current match length and return the longest one for all still active recognizers. (Interface uax.UnicodeBreaker)

func (*WordBreaker) Penalties ¶

func (gb *WordBreaker) Penalties() []int

Penalties gets all active penalties for all active recognizers combined. Index 0 belongs to the most recently read rune, i.e., represents the penalty for breaking after it. (Interface uax.UnicodeBreaker)

func (*WordBreaker) ProceedWithRune ¶

func (gb *WordBreaker) ProceedWithRune(r rune, cpClass int)

ProceedWithRune is a signal: A new code-point has been read and this breaker receives a message to consume it. (Interface uax.UnicodeBreaker)

func (*WordBreaker) StartRulesFor ¶

func (gb *WordBreaker) StartRulesFor(r rune, cpClass int)

StartRulesFor starts all recognizers where the starting symbol is rune r. r is of code-point-class cpClass. (Interface uax.UnicodeBreaker)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
internal
generator Package for a generator for UAX#29 word breaking classes.	Package for a generator for UAX#29 word breaking classes.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL