uax29

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 8, 2021 License: BSD-3-Clause, Unlicense Imports: 7 Imported by: 1

Documentation

Overview

Package uax29 implements Unicode Annex #29 word breaking.

Content

UAX#29 is the Unicode Annex for breaking text into graphemes, words and sentences. It defines code-point classes and sets of rules for how to place break points and break inhibitors. This file is about word breaking.

This segmenter passes all 1823 tests of the Unicode UAX#29 test suite for word breaking.

Typical Usage

Clients instantiate a WordBreaker object and use it as the breaking engine for a segmenter.

onWords := uax29.NewWordBreaker()
segmenter := uax.NewSegmenter(onWords)
segmenter.Init(...)
for segmenter.Next() ...

Attention

Before using word breakers, clients usually should initialize the classes and rules:

SetupUAX29Classes()

This initializes all the code-point range tables. Initialization is not done beforehand, as it consumes quite some memory. However, the word breaker will call it if range tables are not yet initialized.

______________________________________________________________________

License

This project is provided under the terms of the UNLICENSE or the 3-Clause BSD license denoted by the following SPDX identifier:

SPDX-License-Identifier: 'Unlicense' OR 'BSD-3-Clause'

You may use the project under the terms of either license.

Licenses are reproduced in the license file in the root folder of this module.

Copyright © 2021 Norbert Pillmayer <norbert@pillmayer.com>

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	PenaltyForBreak        = 50
	PenaltyToSuppressBreak = 10000
	PenaltyForMustBreak    = -10000
)

Penalties (inter-word optional break, suppress break and mandatory break).

View Source
var ALetter, CR, Double_Quote, Extend, ExtendNumLet, Format, Hebrew_Letter, Katakana, LF, MidLetter,
	MidNum, MidNumLet, Newline, Numeric, Regional_Indicator, Single_Quote, WSegSpace, ZWJ *unicode.RangeTable

Range tables for UAX#29 code-point classes. Will be initialized with SetupUAX29Classes(). Clients can check with unicode.Is(..., rune)

Functions

func SetupUAX29Classes

func SetupUAX29Classes()

SetupUAX29Classes is the top-level preparation function: Create code-point classes for word breaking. Will in turn set up emoji classes as well. (Concurrency-safe).

The word breaker will call this transparently if it has not been called beforehand.

Types

type UAX29Class

type UAX29Class int

Type for UAX#29 code-point classes. Must be convertable to int.

const (
	ALetterClass            UAX29Class = 0
	CRClass                 UAX29Class = 1
	Double_QuoteClass       UAX29Class = 2
	ExtendClass             UAX29Class = 3
	ExtendNumLetClass       UAX29Class = 4
	FormatClass             UAX29Class = 5
	Hebrew_LetterClass      UAX29Class = 6
	KatakanaClass           UAX29Class = 7
	LFClass                 UAX29Class = 8
	MidLetterClass          UAX29Class = 9
	MidNumClass             UAX29Class = 10
	MidNumLetClass          UAX29Class = 11
	NewlineClass            UAX29Class = 12
	NumericClass            UAX29Class = 13
	Regional_IndicatorClass UAX29Class = 14
	Single_QuoteClass       UAX29Class = 15
	WSegSpaceClass          UAX29Class = 16
	ZWJClass                UAX29Class = 17

	Other UAX29Class = 999
)

These are all the UAX#29 breaking classes.

func ClassForRune

func ClassForRune(r rune) UAX29Class

ClassForRune gets the Unicode #UAX29 word class for a Unicode code-point.

func (UAX29Class) String

func (c UAX29Class) String() string

Stringer for type UAX29Class

type WordBreaker

type WordBreaker struct {
	// contains filtered or unexported fields
}

WordBreaker is a Breaker type used by a uax.Segmenter to break text up according to UAX#29 / Words. It implements the uax.UnicodeBreaker interface.

Example
package main

import (
	"fmt"
	"strings"

	"github.com/npillmayer/uax/segment"
	"github.com/npillmayer/uax/uax29"
)

func main() {
	onWords := uax29.NewWordBreaker(1)
	segmenter := segment.NewSegmenter(onWords)
	segmenter.Init(strings.NewReader("Hello World🇩🇪!"))
	for segmenter.Next() {
		fmt.Printf("'%s'\n", segmenter.Text())
	}
}
Output:

'Hello'
' '
'World'
'🇩🇪'
'!'

func NewWordBreaker

func NewWordBreaker(weight int) *WordBreaker

NewWordBreaker creates a a new UAX#29 word breaker.

Usage:

onWords := NewWordBreaker()
segmenter := uax.NewSegmenter(onWords)
segmenter.Init(...)
for segmenter.Next() ...

weight is a multiplying factor for penalties. It must be 0…w…5 and will be capped for values outside this range. Currently this is not used by any test and should probably left to 1.

func (*WordBreaker) CodePointClassFor

func (gb *WordBreaker) CodePointClassFor(r rune) int

CodePointClassFor returns the UAX#29 word code-point class for a rune (= code-point). (Interface uax.UnicodeBreaker)

func (*WordBreaker) LongestActiveMatch

func (gb *WordBreaker) LongestActiveMatch() int

LongestActiveMatch collects from all active recognizers information about current match length and return the longest one for all still active recognizers. (Interface uax.UnicodeBreaker)

func (*WordBreaker) Penalties

func (gb *WordBreaker) Penalties() []int

Penalties gets all active penalties for all active recognizers combined. Index 0 belongs to the most recently read rune, i.e., represents the penalty for breaking after it. (Interface uax.UnicodeBreaker)

func (*WordBreaker) ProceedWithRune

func (gb *WordBreaker) ProceedWithRune(r rune, cpClass int)

ProceedWithRune is a signal: A new code-point has been read and this breaker receives a message to consume it. (Interface uax.UnicodeBreaker)

func (*WordBreaker) StartRulesFor

func (gb *WordBreaker) StartRulesFor(r rune, cpClass int)

StartRulesFor starts all recognizers where the starting symbol is rune r. r is of code-point-class cpClass. (Interface uax.UnicodeBreaker)

Directories

Path Synopsis
internal
generator
Package for a generator for UAX#29 word breaking classes.
Package for a generator for UAX#29 word breaking classes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL