Documentation ¶
Overview ¶
Package uax29 implements Unicode Annex #29 word breaking.
Content ¶
UAX#29 is the Unicode Annex for breaking text into graphemes, words and sentences. It defines code-point classes and sets of rules for how to place break points and break inhibitors. This file is about word breaking.
This segmenter passes all 1823 tests of the Unicode UAX#29 test suite for word breaking.
Typical Usage ¶
Clients instantiate a WordBreaker object and use it as the breaking engine for a segmenter.
onWords := uax29.NewWordBreaker() segmenter := uax.NewSegmenter(onWords) segmenter.Init(...) for segmenter.Next() ...
Attention ¶
Before using word breakers, clients usually should initialize the classes and rules:
SetupUAX29Classes()
This initializes all the code-point range tables. Initialization is not done beforehand, as it consumes quite some memory. However, the word breaker will call it if range tables are not yet initialized.
______________________________________________________________________
License ¶
This project is provided under the terms of the UNLICENSE or the 3-Clause BSD license denoted by the following SPDX identifier:
SPDX-License-Identifier: 'Unlicense' OR 'BSD-3-Clause'
You may use the project under the terms of either license.
Licenses are reproduced in the license file in the root folder of this module.
Copyright © 2021 Norbert Pillmayer <norbert@pillmayer.com>
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ( PenaltyForBreak = 50 PenaltyToSuppressBreak = 10000 PenaltyForMustBreak = -10000 )
Penalties (inter-word optional break, suppress break and mandatory break).
var ALetter, CR, Double_Quote, Extend, ExtendNumLet, Format, Hebrew_Letter, Katakana, LF, MidLetter,
MidNum, MidNumLet, Newline, Numeric, Regional_Indicator, Single_Quote, WSegSpace, ZWJ *unicode.RangeTable
Range tables for UAX#29 code-point classes. Will be initialized with SetupUAX29Classes(). Clients can check with unicode.Is(..., rune)
Functions ¶
func SetupUAX29Classes ¶
func SetupUAX29Classes()
SetupUAX29Classes is the top-level preparation function: Create code-point classes for word breaking. Will in turn set up emoji classes as well. (Concurrency-safe).
The word breaker will call this transparently if it has not been called beforehand.
Types ¶
type UAX29Class ¶
type UAX29Class int
Type for UAX#29 code-point classes. Must be convertable to int.
const ( ALetterClass UAX29Class = 0 CRClass UAX29Class = 1 Double_QuoteClass UAX29Class = 2 ExtendClass UAX29Class = 3 ExtendNumLetClass UAX29Class = 4 FormatClass UAX29Class = 5 Hebrew_LetterClass UAX29Class = 6 KatakanaClass UAX29Class = 7 LFClass UAX29Class = 8 MidLetterClass UAX29Class = 9 MidNumClass UAX29Class = 10 MidNumLetClass UAX29Class = 11 NewlineClass UAX29Class = 12 NumericClass UAX29Class = 13 Regional_IndicatorClass UAX29Class = 14 Single_QuoteClass UAX29Class = 15 WSegSpaceClass UAX29Class = 16 ZWJClass UAX29Class = 17 Other UAX29Class = 999 )
These are all the UAX#29 breaking classes.
func ClassForRune ¶
func ClassForRune(r rune) UAX29Class
ClassForRune gets the Unicode #UAX29 word class for a Unicode code-point.
type WordBreaker ¶
type WordBreaker struct {
// contains filtered or unexported fields
}
WordBreaker is a Breaker type used by a uax.Segmenter to break text up according to UAX#29 / Words. It implements the uax.UnicodeBreaker interface.
Example ¶
package main import ( "fmt" "strings" "github.com/npillmayer/uax/segment" "github.com/npillmayer/uax/uax29" ) func main() { onWords := uax29.NewWordBreaker(1) segmenter := segment.NewSegmenter(onWords) segmenter.Init(strings.NewReader("Hello World🇩🇪!")) for segmenter.Next() { fmt.Printf("'%s'\n", segmenter.Text()) } }
Output: 'Hello' ' ' 'World' '🇩🇪' '!'
func NewWordBreaker ¶
func NewWordBreaker(weight int) *WordBreaker
NewWordBreaker creates a a new UAX#29 word breaker.
Usage:
onWords := NewWordBreaker() segmenter := uax.NewSegmenter(onWords) segmenter.Init(...) for segmenter.Next() ...
weight is a multiplying factor for penalties. It must be 0…w…5 and will be capped for values outside this range. Currently this is not used by any test and should probably left to 1.
func (*WordBreaker) CodePointClassFor ¶
func (gb *WordBreaker) CodePointClassFor(r rune) int
CodePointClassFor returns the UAX#29 word code-point class for a rune (= code-point). (Interface uax.UnicodeBreaker)
func (*WordBreaker) LongestActiveMatch ¶
func (gb *WordBreaker) LongestActiveMatch() int
LongestActiveMatch collects from all active recognizers information about current match length and return the longest one for all still active recognizers. (Interface uax.UnicodeBreaker)
func (*WordBreaker) Penalties ¶
func (gb *WordBreaker) Penalties() []int
Penalties gets all active penalties for all active recognizers combined. Index 0 belongs to the most recently read rune, i.e., represents the penalty for breaking after it. (Interface uax.UnicodeBreaker)
func (*WordBreaker) ProceedWithRune ¶
func (gb *WordBreaker) ProceedWithRune(r rune, cpClass int)
ProceedWithRune is a signal: A new code-point has been read and this breaker receives a message to consume it. (Interface uax.UnicodeBreaker)
func (*WordBreaker) StartRulesFor ¶
func (gb *WordBreaker) StartRulesFor(r rune, cpClass int)
StartRulesFor starts all recognizers where the starting symbol is rune r. r is of code-point-class cpClass. (Interface uax.UnicodeBreaker)