uax

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 8, 2021 License: BSD-3-Clause, Unlicense Imports: 5 Imported by: 1

README

UAX Logo

Unicode Text Segmentation Algorithms

Text processing applications need to segment text into pieces. Segments may be

  • words
  • sentences
  • paragraphs

and so on. For western languages this is not too hard of a problem, but it may become an involved endeavor if you consider Arabic or Asian languages. From a typographic viewpoint some of these languages present serious challenges for correct segmenting. The Unicode consortium publishes recommendations and algorithms for various aspects of text segmentation in their Unicode Annexes (UAX).

Text Segmentation in Go

There exist a number of Unicode standards describing best practices for text segmentation. Unfortunately, implementations in Go are sparse. Marcel van Lohuizen from the Go Core Team seems to be working on text segmenting, but with low priority. In the long run, it will be best to wait for the standard library to include functions for text segmentation. However, for now I will implement my own.

Status

This is very much work in progress, not intended for production use. Please be patient.

Documentation

Overview

Package uax is about Unicode Annexes and their algorithms.

Description

From the Unicode Consortium:

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.

[...]

A string of Unicode‐encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E FULL STOP) is used ambiguously, sometimes for end‐of‐sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user.

[...]

There are many different ways to divide text elements corresponding to user‐perceived characters, words, and sentences, and the Unicode Standard does not restrict the ways in which implementations can produce these divisions.

This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments. For example, reliable detection of word boundaries in languages such as Thai, Lao, Chinese, or Japanese requires the use of dictionary lookup, analogous to English hyphenation.

Package Contents

Implementations of specific UAX algorithms is done in the various sub-packages of uax. The driver type for some of the breaking-algorithms sits in sub-package segment and will use breaker-algorithms from other sub-packages.

Base package uax provides some of the necessary means to implement UAX breaking algorithms. Please note that it is in now way mandatory to use the supporting types and functions of this package. Implementors of additional breaking algorithms are free to ignore some or all of the helpers and instead implement their breaking algorithms from scratch.

Every implementation of UAX breaking algorithms has to handle the trade-off between efficiency and understandability. Algorithms as described in the Unicodes Annex documents are no easy read when considering all the details and edge cases. Getting it 100% right therefore sometimes may be tricky. Implementations in the sub-packages of uax try to strike a balance between efficiency and readability. The helper classes of uax allow implementors to transform UAX-rules into fairly readable small functions. From a maintenance point-of-view this is preferrable to huge and complex cascades of if-statements, which may provide better performance, but are hard to understand. Most of the breaking algorithms within sub-packages of uax therefore utilize the helper types from package uax.

We perform segmenting Unicode text based on rules, which are short regular expressions, i.e. finite state automata. This corresponds well with the formal UAX description of rules (except for the Bidi-rules, which are better understood as rules for a context-sensitive grammar). Every step within a rule is performed by executing a function. This function recognizes a single code-point class and returns another function. The returned function represents the expectation for the next code-point(-class). These kind of matching by function is continued until a rule is accepted or aborted.

An example let's consider rule WB13b “Do not break from extenders” from UAX#29:

ExtendNumLet x (ALetter | Hebrew_Letter| Numeric | Katakana)

The 'x' denotes a suppressed break. All the identifiers are UAX#29-specific classes for code-points. Matching them will call two functions in sequence:

   rule_WB13b( … )   // match ExtendNumLet
-> finish_WB13b( … ) // match any of ALetter … Katakana

The final return value will either signal an accept or abort.

The uax helper to perform this kind of matching is called Recognizer. A set of Recognizers comprises an NFA and will match break opportunities for a given UAX rule-set. Recognizers receive rune events and therefore implement interface RuneSubscriber.

Rune Events

Walking the runes (= code-points) of a Unicode text and firing rules to match segments will produce a high fluctuation of short-lived Recognizers. Every Recognizer will have to react to the next rune read. Package uax provides a publish-subscribe mechanism for signalling new runes to all active Recognizers.

The default rune-publisher will distribute rune events to rune-subscribers and collect return values. Subscribers are required to return active matches and possible break-opportunities (or suppression thereof). After all subscribers are done consuming the rune, the publisher harvests subscribers which have ended their life-cycle (i.e., either accepted or aborted). Dead subscribers are flagging this with Done()==true and get unsubscribed.

Breaking algorithms are performed by `UnicodeBreaker`s (an interface type). The UnicodeBreakers in sub-packages of this package utilize UnicodePublishers as described above. The segment-driver needs one or more UnicodeBreakers to perform breaking logic.

Penalties

Algorithms in this package will signal break opportunities for Unicode text. However, breaks are not signalled with true/false, but rather with a weighted “penalty.” Every break is connoted with an integer value, representing the desirability of the break. Negative values denote a negative penalty, i.e., a merit. High enough penalties signal the complete suppression of a break opportunity, causing the segmenter to not report this break.

The UnicodeBreakers in this package (including sub-packages) will apply the following logic:

(1) Mandatory breaks will have a penalty/merit of ≤ -10000 (uax.InfinitePenalty).

(2) Inhibited breaks will have penalty ≥ 10000 (uax.InfiniteMerits).

(3) Neutral positions will have a penalty of 0. The segmenter can be configured to regard the zero value as breakable or not.

The segmenter will aggregate penalties from its breakers and output aggregated penalties to the client.

______________________________________________________________________

License

This project is provided under the terms of the UNLICENSE or the 3-Clause BSD license denoted by the following SPDX identifier:

SPDX-License-Identifier: 'Unlicense' OR 'BSD-3-Clause'

You may use the project under the terms of either license.

Licenses are reproduced in the license file in the root folder of this module.

Copyright © 2021 Norbert Pillmayer <norbert@pillmayer.com>

Index

Constants

View Source
const (
	InfinitePenalty = 10000
	InfiniteMerits  = -10000
)

We define constants for flagging break points as infinitely bad and infinitely good, respectively.

Variables

This section is empty.

Functions

func PositionOfFirstLegalRune

func PositionOfFirstLegalRune(s string) (int, []byte)

PositionOfFirstLegalRune returns a legal Unicode code point start position and cut-off prefix, if any.

Types

type DefaultRunePublisher

type DefaultRunePublisher struct {
	// contains filtered or unexported fields
}

DefaultRunePublisher is a type to organize RuneSubscribers.

Rune publishers have to maintain a list of subscribers. Subscribers are then notified on the arrival of new runes (code-points) by sending them rune-events. When a subscriber is done with consuming runes (subscribers are often short-lived), it signals Done()=true.

The DefaultRunePublisher data structure "prioritizes" subscribers with Done()=true within a queue. It maintains a "gap" position between done and not-done. The queue grows as needed.

A DefaultRunePublisher implements RunePublisher.

func NewRunePublisher

func NewRunePublisher() *DefaultRunePublisher

NewRunePublisher creates a new default RunePublisher.

func (*DefaultRunePublisher) Fix

func (pq *DefaultRunePublisher) Fix(at int)

Fix signals that the Done()-flag of a subscriber has changed: inform the queue to let it re-organize.

func (DefaultRunePublisher) Len

func (pq DefaultRunePublisher) Len() int

Len returns the number of subscribers held.

func (*DefaultRunePublisher) Pop

Pop the topmost subscriber.

func (*DefaultRunePublisher) PopDone

func (pq *DefaultRunePublisher) PopDone() RuneSubscriber

PopDone pops the topmost subscriber if it is Done(), otherwise return nil. If the method returns nil, the queue either is empty or holds subscribers with Done()=false only (i.e., subscribers still active).

func (*DefaultRunePublisher) PublishRuneEvent

func (rpub *DefaultRunePublisher) PublishRuneEvent(r rune, codePointClass int) (int, []int)

PublishRuneEvent triggers a rune event notification to all subscribers. Rune events include the rune (code-point) and an optional code-point class for the rune.

Return values are: the longest active match and a slice of penalties. These values usually are collected from the RuneSubscribers. Penalties will be overwritten by the next call to PublishRuneEvent(). Clients will have to make a copy if they want to preserve penalty values.

Interface RunePublisher

func (*DefaultRunePublisher) Push

func (pq *DefaultRunePublisher) Push(subscr RuneSubscriber)

Push puts a new subscriber to the queue.

func (*DefaultRunePublisher) SubscribeMe

func (rpub *DefaultRunePublisher) SubscribeMe(rsub RuneSubscriber) RunePublisher

SubscribeMe lets a client subscribe to a RunePublisher.

Part of interface RunePublisher.

func (DefaultRunePublisher) Top

Top subscriber in queue. If there is at last one Done() subscriber, top() will return one.

type NfaStateFn

type NfaStateFn func(*Recognizer, rune, int) NfaStateFn

NfaStateFn represents a state in a non-deterministic finite automata. Functions of type NfaStateFn try to match a rune (Unicode code-point). The caller may provide a third argument, which should be a rune class. Rune (code-point) classes are described in various Unicode standards and annexes. One such annex, UAX#29, describes classes to help splitting up text into graphemes or words. An example class may be a class of western language alphabetic characters “AL”, of which runes ‘X’ and ‘é’ would be part of.

The first argument is a Recognizer (see the definition of type Recognizer in this package), which carries this state function.

NfaStateFn – after matching a rune – must return another NfaStateFn, which will then in turn be called to process the next rune. The process of matching a string will stop as soon as a NfaStateFn returns nil.

func DoAbort

func DoAbort(rec *Recognizer) NfaStateFn

DoAbort returns a state function which signals abort.

func DoAccept

func DoAccept(rec *Recognizer, penalties ...int) NfaStateFn

DoAccept returns a state function which signals accept, together with break penalties for matched runes (in reverse sequence).

type Recognizer

type Recognizer struct {
	Expect   int         // next code-point to expect; semantics are up to the client
	MatchLen int         // length of active match
	UserData interface{} // clients may need to store additional information
	// contains filtered or unexported fields
}

A Recognizer represents an automata to recognize sequences of runes (i.e. Unicode code-points). Its main functionality is performed by an embedded NfaStateFn. The first NfaStateFn to use is provided with the constructor.

Recognizer's state functions must be careful to increment MatchLen with each matched rune. Failing to do so may result in incorrect splits of text.

Semantics of Expect and UserData are up to the client and not used by the default mechanism.

It is not mandatory to use Recognizers for segmenting text. The type is provided for easier implementation of types implementing UnicodeBreaker. Recognizers implement interface RuneSubscriber and UnicodeBreakers will use a UnicodePublisher to interact with them.

func NewPooledRecognizer

func NewPooledRecognizer(cpClass int, stateFn NfaStateFn) *Recognizer

NewPooledRecognizer returns a new Recognizer, pre-filled with an expected code-point class and a state function. The Recognizer is pooled for efficiency.

func NewRecognizer

func NewRecognizer(codePointClass int, next NfaStateFn) *Recognizer

NewRecognizer creates a new Recognizer. This is rarely used, as clients rather should call NewPooledRecognizer().

see NewPooledRecognizer.

func (*Recognizer) Done

func (rec *Recognizer) Done() bool

Done is used by a Recognizer that it is done matching runes. If MatchLength() > 0 is has been accepting a sequence of runes, otherwise it has aborted to further try a match.

Interface RuneSubscriber

func (*Recognizer) MatchLength

func (rec *Recognizer) MatchLength() int

MatchLength is part of interface RuneSubscriber.

func (*Recognizer) RuneEvent

func (rec *Recognizer) RuneEvent(r rune, codePointClass int) []int

RuneEvent is part of interface RuneSubscriber.

func (*Recognizer) String

func (rec *Recognizer) String() string

Simple stringer for debugging purposes.

func (*Recognizer) Unsubscribed

func (rec *Recognizer) Unsubscribed()

Unsubscribed signals to a Recognizer that it has been unsubscribed from a RunePublisher; usually after the Recognizer's NfaStateFn has returned nil.

Interface RuneSubscriber

type RunePublisher

type RunePublisher interface {
	SubscribeMe(RuneSubscriber) RunePublisher // subscribe an additional rune subscriber
	PublishRuneEvent(r rune, codePointClass int) (longestDistance int, penalties []int)
}

A RunePublisher notifies subscribers with rune events: a new rune has been read and the subscriber – usually a recognizer rule – has to react to it.

UnicodeBreakers are not required to use the RunePublisher/RuneSubscriber pattern, but it is convenient to stick to it. UnicodeBreakers often rely on sets of rules, which are tested interleavingly. To releave UnicodeBreakers from managing rune-distribution to all the rules, it may be advantageous hold a RunePublisher within a UnicodeBreaker and let all rules implement the RuneSubscriber interface.

type RuneSubscriber

type RuneSubscriber interface {
	RuneEvent(r rune, codePointClass int) []int // receive a new code-point
	MatchLength() int                           // length (in # of code-point) of the match up to now
	Done() bool                                 // is this subscriber done?
	Unsubscribed()                              // this subscriber has been unsubscribed
}

A RuneSubscriber is a receiver of rune events, i.e. messages to process a new code-point (rune). If they can match the rune, they will expect further runes, otherwise they abort. To they are finished, either by accepting or rejecting input, they set Done() to true. A successful acceptance of input is signalled by Done()==true and MatchLength()>0.

type UnicodeBreaker

type UnicodeBreaker interface {
	CodePointClassFor(rune) int
	StartRulesFor(rune, int)
	ProceedWithRune(rune, int)
	LongestActiveMatch() int
	Penalties() []int
}

UnicodeBreaker represents a logic to split up Unicode sequences into smaller parts. They are used by Segmenters to supply breaking logic.

Directories

Path Synopsis
Package emoji implements Unicode UTS #51 emoji classes.
Package emoji implements Unicode UTS #51 emoji classes.
internal/generator
Package for a generator for UTS#51 Emoji character classes.
Package for a generator for UTS#51 Emoji character classes.
Package grapheme implements Unicode Annex #29 grapheme breaking.
Package grapheme implements Unicode Annex #29 grapheme breaking.
internal/generator
Package for a generator for UAX#29 Grapheme classes.
Package for a generator for UAX#29 Grapheme classes.
internal
tablegen
tablegen is a helper CLI to create Go source files from Unicode Character Data files.
tablegen is a helper CLI to create Go source files from Unicode Character Data files.
ucdparse
Package ucdparse provides a parser for Unicode Character Database files.
Package ucdparse provides a parser for Unicode Character Database files.
Package segment is about Unicode text segmenting.
Package segment is about Unicode text segmenting.
Package shaping provides tables corresponding to Unicode® Character Data tables relevant for text shaping.
Package shaping provides tables corresponding to Unicode® Character Data tables relevant for text shaping.
Package uax11 provides utilities for Unicode® Standard Annex #11 “East Asian Width”.
Package uax11 provides utilities for Unicode® Standard Annex #11 “East Asian Width”.
Package uax14 implements Unicode Annex #14 line wrap.
Package uax14 implements Unicode Annex #14 line wrap.
internal/generator
Package generator is a generator for UAX#14 classes.
Package generator is a generator for UAX#14 classes.
Package uax29 implements Unicode Annex #29 word breaking.
Package uax29 implements Unicode Annex #29 word breaking.
internal/generator
Package for a generator for UAX#29 word breaking classes.
Package for a generator for UAX#29 word breaking classes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL