uax14

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 8, 2021 License: BSD-3-Clause, Unlicense Imports: 7 Imported by: 2

Documentation

Overview

Package uax14 implements Unicode Annex #14 line wrap.

Under active development; use at your own risk

Contents

UAX#14 is the Unicode Annex for Line Breaking (Line Wrap). It defines a bunch of code-point classes and a set of rules for how to place break points / break inhibitors.

Typical Usage

Clients instantiate a UAX#14 line breaker object and use it as the breaking engine for a segmenter.

breaker := uax14.NewLineWrap()
segmenter := unicode.NewSegmenter(breaker)
segmenter.Init(...)
for segmenter.Next() {
  ... // do something with segmenter.Text() or segmenter.Bytes()
}

Before using line breakers, clients usually will want to initialize the UAX#14 classes and rules.

SetupClasses()

This initializes all the code-point range tables. Initialization is not done beforehand, as it consumes quite some memory, and using UAX#14 is not mandatory. SetupClasses() is called automatically, however, if clients call NewLineWrap().

Status

The current implementation passes all tests from the UAX#14 test file, except 3:

uax14_test.go:65: 3 TEST CASES OUT of 7001 FAILED

______________________________________________________________________

License

This project is provided under the terms of the UNLICENSE or the 3-Clause BSD license denoted by the following SPDX identifier:

SPDX-License-Identifier: 'Unlicense' OR 'BSD-3-Clause'

You may use the project under the terms of either license.

Licenses are reproduced in the license file in the root folder of this module.

Copyright © 2021 Norbert Pillmayer <norbert@pillmayer.com>

Index

Constants

This section is empty.

Variables

View Source
var (
	PenaltyToSuppressBreak = 10000  // Suppress break: ×
	PenaltyForMustBreak    = -19000 // Break: !
	DefaultPenalty         = 1      // Rule LB31: ÷    fragile, do not change!
)

Penalties (suppress break and mandatory break).

View Source
var AI, AL, B2, BA, BB, BK, CB, CJ, CL, CM,
	CP, CR, EB, EM, EX, GL, H2, H3, HL, HY,
	ID, IN, IS, JL, JT, JV, LF, NL, NS, NU,
	OP, PO, PR, QU, RI, SA, SG, SP, SY, WJ,
	XX, ZW, ZWJ *unicode.RangeTable

Range tables for UAX#14 code-point classes. Will be initialized with SetupUAX14Classes(). Clients can check with unicode.Is(..., rune)

Functions

func SetupClasses

func SetupClasses()

SetupClasses is the top-level preparation function: Create code-point classes for UAX#14 line breaking/wrap. (Concurrency-safe).

Types

type LineWrap

type LineWrap struct {
	// contains filtered or unexported fields
}

LineWrap is a type used by a unicode.Segmenter to break lines up according to UAX#14. It implements the unicode.UnicodeBreaker interface.

func NewLineWrap

func NewLineWrap() *LineWrap

NewLineWrap creates a new UAX#14 line breaker.

Usage:

linewrap := NewLineWrap()
segmenter := segment.NewSegmenter(linewrap)
segmenter.Init(...)
for segmenter.Next() ...

func (*LineWrap) CodePointClassFor

func (uax14 *LineWrap) CodePointClassFor(r rune) int

CodePointClassFor returns the UAX#14 code-point class for a rune (= code-point).

Interface unicode.UnicodeBreaker

func (*LineWrap) LongestActiveMatch

func (uax14 *LineWrap) LongestActiveMatch() int

LongestActiveMatch is part of interface unicode.UnicodeBreaker

func (*LineWrap) Penalties

func (uax14 *LineWrap) Penalties() []int

Penalties gets all active penalties for all active recognizers combined. Index 0 belongs to the most recently read rune.

Interface unicode.UnicodeBreaker

func (*LineWrap) ProceedWithRune

func (uax14 *LineWrap) ProceedWithRune(r rune, cpClass int)

ProceedWithRune is part of interface unicode.Breaker. A new code-point has been read and this breaker receives a message to consume it.

func (*LineWrap) StartRulesFor

func (uax14 *LineWrap) StartRulesFor(r rune, cpClass int)

StartRulesFor starts all recognizers where the starting symbol is rune r. r is of code-point-class cpClass.

Interface unicode.UnicodeBreaker

type UAX14Class

type UAX14Class int

Type for UAX#14 code-point classes. Must be convertable to int.

const (
	AIClass  UAX14Class = 0
	ALClass  UAX14Class = 1
	B2Class  UAX14Class = 2
	BAClass  UAX14Class = 3
	BBClass  UAX14Class = 4
	BKClass  UAX14Class = 5
	CBClass  UAX14Class = 6
	CJClass  UAX14Class = 7
	CLClass  UAX14Class = 8
	CMClass  UAX14Class = 9
	CPClass  UAX14Class = 10
	CRClass  UAX14Class = 11
	EBClass  UAX14Class = 12
	EMClass  UAX14Class = 13
	EXClass  UAX14Class = 14
	GLClass  UAX14Class = 15
	H2Class  UAX14Class = 16
	H3Class  UAX14Class = 17
	HLClass  UAX14Class = 18
	HYClass  UAX14Class = 19
	IDClass  UAX14Class = 20
	INClass  UAX14Class = 21
	ISClass  UAX14Class = 22
	JLClass  UAX14Class = 23
	JTClass  UAX14Class = 24
	JVClass  UAX14Class = 25
	LFClass  UAX14Class = 26
	NLClass  UAX14Class = 27
	NSClass  UAX14Class = 28
	NUClass  UAX14Class = 29
	OPClass  UAX14Class = 30
	POClass  UAX14Class = 31
	PRClass  UAX14Class = 32
	QUClass  UAX14Class = 33
	RIClass  UAX14Class = 34
	SAClass  UAX14Class = 35
	SGClass  UAX14Class = 36
	SPClass  UAX14Class = 37
	SYClass  UAX14Class = 38
	WJClass  UAX14Class = 39
	XXClass  UAX14Class = 40
	ZWClass  UAX14Class = 41
	ZWJClass UAX14Class = 42
)

These are all the UAX#14 breaking classes.

func ClassForRune

func ClassForRune(r rune) UAX14Class

ClassForRune gets the line breaking/wrap class for a Unicode code-point

func (UAX14Class) String

func (c UAX14Class) String() string

Stringer for type UAX14Class

Directories

Path Synopsis
internal
generator
Package generator is a generator for UAX#14 classes.
Package generator is a generator for UAX#14 classes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL