grapheme

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 8, 2021 License: BSD-3-Clause, Unlicense Imports: 12 Imported by: 2

Documentation

Overview

Package grapheme implements Unicode Annex #29 grapheme breaking.

UAX#29 is the Unicode Annex for breaking text into graphemes, words and sentences. It defines code-point classes and sets of rules for how to place break points and break inhibitors. This file is about grapheme breaking.

Typical Usage with a Segmenter

Clients instantiate a grapheme object and use it as the breaking engine for a segmenter.

onGraphemes := grapheme.NewBreaker()
segmenter := uax.NewSegmenter(onGraphemes)
segmenter.Init(…)
for segmenter.Next() {
    grphm := segmenter.Bytes()
    …
}

Grapheme Strings

This package provides an additional convenience type `grapheme.String`. Grapheme strings are a read-only data structure and not intended for large texts, but rather for small to medium-sized strings. For larger texts clients should use a segmenter.

s := grapheme.StringFromString("世界")
fmt.Printf("number of graphemes: %s", s.Len())                      // => 2
fmt.Printf("number of bytes for 2nd grapheme: %d", len(s.Nth(1)))   // => 3

Attention

Before using grapheme breakers, clients will have to initialize the classes and rules.

SetupGraphemeClasses()

This initializes all the code-point range tables. Initialization is not done beforehand, as it consumes quite some memory. As grapheme breaking involves knowledge of emoji classes, a call to SetupGraphemeClasses() will in turn call SetupEmojisClasses().

Usage of grapheme.String will take care of doing the setup behind the scenes.

Conformance

This UnicodeBreaker successfully passes all 672 tests for grapheme breaking of UAX#29 (GraphemeBreakTest.txt). UPDATE: Due to a small change in the segmenters semantics, currently 11 out of 672 tests fail. I did not have the time to look into it.

____________________________________________________________________________

License

This project is provided under the terms of the UNLICENSE or the 3-Clause BSD license denoted by the following SPDX identifier:

SPDX-License-Identifier: 'Unlicense' OR 'BSD-3-Clause'

You may use the project under the terms of either license.

Licenses are reproduced in the license file in the root folder of this module.

Copyright © 2021 Norbert Pillmayer <norbert@pillmayer.com>

Index

Constants

View Source
const (
	GlueBREAK int = -500
	GlueJOIN  int = 10000
	GlueBANG  int = -20000
)

GlueBREAK, JOIN and BANG set default penalty values.

View Source
const MaxByteLen int = 32766

MaxByteLen is the maximum byte count a grapheme string may consist of.

View Source
const Version = "11.0.0"

Version is the Unicode version this package conforms to.

Variables

View Source
var CR, LF, Prepend, Control, Extend, Regional_Indicator, SpacingMark, L, V, T,
	LV, LVT, ZWJ *unicode.RangeTable

Range tables for grapheme code-point classes. Will be initialized with SetupGraphemeClasses(). Clients can check with unicode.Is(..., rune)

Functions

func SetupGraphemeClasses

func SetupGraphemeClasses()

SetupGraphemeClasses is the top-level preparation function: Create code-point classes for grapheme breaking. Will in turn set up emoji classes as well. (Concurrency-safe).

Types

type Breaker

type Breaker struct {
	// contains filtered or unexported fields
}

Breaker is a type to be used by a uax.Segmenter to break text up according to UAX#29 / Graphemes. It implements the uax.UnicodeBreaker interface.

func NewBreaker

func NewBreaker(weight int) *Breaker

NewBreaker creates a new UAX#29 line breaker.

Usage:

onGraphemes := NewBreaker()
segmenter := uax.NewSegmenter(onGraphemes)
segmenter.Init(...)
for segmenter.Next() ...

weight is a multilying factor for penalties. It must be 0…w…5 and will be capped for values outside this range.

func (*Breaker) CodePointClassFor

func (gb *Breaker) CodePointClassFor(r rune) int

CodePointClassFor returns the grapheme code-point class for a rune (= code-point). (Interface uax.UnicodeBreaker)

func (*Breaker) LongestActiveMatch

func (gb *Breaker) LongestActiveMatch() int

LongestActiveMatch collects information from all active recognizers about current match length and return the longest one for all still active recognizers. (Interface uax.UnicodeBreaker)

func (*Breaker) Penalties

func (gb *Breaker) Penalties() []int

Penalties gets all active penalties for all active recognizers combined. Index 0 belongs to the most recently read rune, i.e., represents the penalty for breaking after it. (Interface uax.UnicodeBreaker)

func (*Breaker) ProceedWithRune

func (gb *Breaker) ProceedWithRune(r rune, cpClass int)

ProceedWithRune is a signal to a Breaker: A new code-point has been read and this breaker receives a message to consume it. (Interface uax.UnicodeBreaker)

func (*Breaker) StartRulesFor

func (gb *Breaker) StartRulesFor(r rune, cpClass int)

StartRulesFor starts all recognizers where the starting symbol is rune r. r is of code-point-class cpClass. (Interface uax.UnicodeBreaker)

TODO merge this with ProceedWithRune(), it is unnecessary

type GraphemeClass

type GraphemeClass int

Type for UAX#29 grapheme code-point classes. Must be convertable to int.

const (
	CRClass                 GraphemeClass = 0
	LFClass                 GraphemeClass = 1
	PrependClass            GraphemeClass = 2
	ControlClass            GraphemeClass = 3
	ExtendClass             GraphemeClass = 4
	Regional_IndicatorClass GraphemeClass = 5
	SpacingMarkClass        GraphemeClass = 6
	LClass                  GraphemeClass = 7
	VClass                  GraphemeClass = 8
	TClass                  GraphemeClass = 9
	LVClass                 GraphemeClass = 10
	LVTClass                GraphemeClass = 11
	ZWJClass                GraphemeClass = 12

	Any GraphemeClass = 999
)

These are all the grapheme breaking classes.

func ClassForRune

func ClassForRune(r rune) GraphemeClass

ClassForRune gets the line grapheme class for a Unicode code-point.

func (GraphemeClass) String

func (c GraphemeClass) String() string

Stringer for type GraphemeClass

type String

type String interface {
	Nth(int) string // return nth grapheme
	Len() int       // length of string in units of user perceived characters
	String() string // return the underlying Go string
}

String is a type to represent a graheme string, i.e. a sequence of “user perceived characters” as defined by Unicode. A grapheme string is a read-only data structure.

Finding graphemes from a string (or array of bytes) is an operation with runtime complexiy O(N). Clients should not convert large texts into grapheme strings in one go, but rather operate on manageable fragments.

func StringFromBytes

func StringFromBytes(b []byte) String

StringFromBytes creates a grapheme string from an array of bytes. As grapheme strings are a read-only data structure, StringFromBytes will create a private copy of the input.

As grapheme strings are not meant to be created for large amounts of text, but rather for manageable segments, b is not allowed to exceed 2^16-1 = 32766 bytes.

StringFromBytes will panic if a larger input slice is given.

StringFromBytes will trim the input to valid Unicode code point (rune) boundaries. If b does not contain any legal runes, the resulting grapheme string may be of length 0 even if the input slice is not.

func StringFromString

func StringFromString(s string) String

StringFromString creates a grapheme string from a Go string. As grapheme strings are not meant to be created for large amounts of text, but rather for manageable segments, s is not allowed to exceed 2^16-1 = 32766 bytes.

StringFromString will panic if a larger input string is given.

StringFromString will trim the input Go string to valid Unicode code point (rune) boundaries. If s does not contain any legal runes, the resulting grapheme string may be of length 0 even if the input string is not.

Directories

Path Synopsis
internal
generator
Package for a generator for UAX#29 Grapheme classes.
Package for a generator for UAX#29 Grapheme classes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL