Documentation ¶
Overview ¶
Package uax14 implements Unicode Annex #14 line wrap.
Under active development; use at your own risk
Contents ¶
UAX#14 is the Unicode Annex for Line Breaking (Line Wrap). It defines a bunch of code-point classes and a set of rules for how to place break points / break inhibitors.
Typical Usage ¶
Clients instantiate a UAX#14 line breaker object and use it as the breaking engine for a segmenter.
breaker := uax14.NewLineWrap() segmenter := unicode.NewSegmenter(breaker) segmenter.Init(...) for segmenter.Next() { ... // do something with segmenter.Text() or segmenter.Bytes() }
Before using line breakers, clients usually will want to initialize the UAX#14 classes and rules.
SetupClasses()
This initializes all the code-point range tables. Initialization is not done beforehand, as it consumes quite some memory, and using UAX#14 is not mandatory. SetupClasses() is called automatically, however, if clients call NewLineWrap().
Status ¶
The current implementation passes all tests from the UAX#14 test file, except 3:
uax14_test.go:65: 3 TEST CASES OUT of 7001 FAILED
______________________________________________________________________
License ¶
This project is provided under the terms of the UNLICENSE or the 3-Clause BSD license denoted by the following SPDX identifier:
SPDX-License-Identifier: 'Unlicense' OR 'BSD-3-Clause'
You may use the project under the terms of either license.
Licenses are reproduced in the license file in the root folder of this module.
Copyright © 2021 Norbert Pillmayer <norbert@pillmayer.com>
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( PenaltyToSuppressBreak = 10000 // Suppress break: × PenaltyForMustBreak = -19000 // Break: ! DefaultPenalty = 1 // Rule LB31: ÷ fragile, do not change! )
Penalties (suppress break and mandatory break).
var AI, AL, B2, BA, BB, BK, CB, CJ, CL, CM,
CP, CR, EB, EM, EX, GL, H2, H3, HL, HY,
ID, IN, IS, JL, JT, JV, LF, NL, NS, NU,
OP, PO, PR, QU, RI, SA, SG, SP, SY, WJ,
XX, ZW, ZWJ *unicode.RangeTable
Range tables for UAX#14 code-point classes. Will be initialized with SetupUAX14Classes(). Clients can check with unicode.Is(..., rune)
Functions ¶
func SetupClasses ¶
func SetupClasses()
SetupClasses is the top-level preparation function: Create code-point classes for UAX#14 line breaking/wrap. (Concurrency-safe).
Types ¶
type LineWrap ¶
type LineWrap struct {
// contains filtered or unexported fields
}
LineWrap is a type used by a unicode.Segmenter to break lines up according to UAX#14. It implements the unicode.UnicodeBreaker interface.
func NewLineWrap ¶
func NewLineWrap() *LineWrap
NewLineWrap creates a new UAX#14 line breaker.
Usage:
linewrap := NewLineWrap() segmenter := segment.NewSegmenter(linewrap) segmenter.Init(...) for segmenter.Next() ...
func (*LineWrap) CodePointClassFor ¶
CodePointClassFor returns the UAX#14 code-point class for a rune (= code-point).
Interface unicode.UnicodeBreaker
func (*LineWrap) LongestActiveMatch ¶
LongestActiveMatch is part of interface unicode.UnicodeBreaker
func (*LineWrap) Penalties ¶
Penalties gets all active penalties for all active recognizers combined. Index 0 belongs to the most recently read rune.
Interface unicode.UnicodeBreaker
func (*LineWrap) ProceedWithRune ¶
ProceedWithRune is part of interface unicode.Breaker. A new code-point has been read and this breaker receives a message to consume it.
func (*LineWrap) StartRulesFor ¶
StartRulesFor starts all recognizers where the starting symbol is rune r. r is of code-point-class cpClass.
Interface unicode.UnicodeBreaker
type UAX14Class ¶
type UAX14Class int
Type for UAX#14 code-point classes. Must be convertable to int.
const ( AIClass UAX14Class = 0 ALClass UAX14Class = 1 B2Class UAX14Class = 2 BAClass UAX14Class = 3 BBClass UAX14Class = 4 BKClass UAX14Class = 5 CBClass UAX14Class = 6 CJClass UAX14Class = 7 CLClass UAX14Class = 8 CMClass UAX14Class = 9 CPClass UAX14Class = 10 CRClass UAX14Class = 11 EBClass UAX14Class = 12 EMClass UAX14Class = 13 EXClass UAX14Class = 14 GLClass UAX14Class = 15 H2Class UAX14Class = 16 H3Class UAX14Class = 17 HLClass UAX14Class = 18 HYClass UAX14Class = 19 IDClass UAX14Class = 20 INClass UAX14Class = 21 ISClass UAX14Class = 22 JLClass UAX14Class = 23 JTClass UAX14Class = 24 JVClass UAX14Class = 25 LFClass UAX14Class = 26 NLClass UAX14Class = 27 NSClass UAX14Class = 28 NUClass UAX14Class = 29 OPClass UAX14Class = 30 POClass UAX14Class = 31 PRClass UAX14Class = 32 QUClass UAX14Class = 33 RIClass UAX14Class = 34 SAClass UAX14Class = 35 SGClass UAX14Class = 36 SPClass UAX14Class = 37 SYClass UAX14Class = 38 WJClass UAX14Class = 39 XXClass UAX14Class = 40 ZWClass UAX14Class = 41 ZWJClass UAX14Class = 42 )
These are all the UAX#14 breaking classes.
func ClassForRune ¶
func ClassForRune(r rune) UAX14Class
ClassForRune gets the line breaking/wrap class for a Unicode code-point