Documentation ¶
Overview ¶
Package grapheme implements Unicode Annex #29 grapheme breaking.
UAX#29 is the Unicode Annex for breaking text into graphemes, words and sentences. It defines code-point classes and sets of rules for how to place break points and break inhibitors. This file is about grapheme breaking.
Typical Usage with a Segmenter ¶
Clients instantiate a grapheme object and use it as the breaking engine for a segmenter.
onGraphemes := grapheme.NewBreaker() segmenter := uax.NewSegmenter(onGraphemes) segmenter.Init(…) for segmenter.Next() { grphm := segmenter.Bytes() … }
Grapheme Strings ¶
This package provides an additional convenience type `grapheme.String`. Grapheme strings are a read-only data structure and not intended for large texts, but rather for small to medium-sized strings. For larger texts clients should use a segmenter.
s := grapheme.StringFromString("世界") fmt.Printf("number of graphemes: %s", s.Len()) // => 2 fmt.Printf("number of bytes for 2nd grapheme: %d", len(s.Nth(1))) // => 3
Attention ¶
Before using grapheme breakers, clients will have to initialize the classes and rules.
SetupGraphemeClasses()
This initializes all the code-point range tables. Initialization is not done beforehand, as it consumes quite some memory. As grapheme breaking involves knowledge of emoji classes, a call to SetupGraphemeClasses() will in turn call SetupEmojisClasses().
Usage of grapheme.String will take care of doing the setup behind the scenes.
Conformance ¶
This UnicodeBreaker successfully passes all 672 tests for grapheme breaking of UAX#29 (GraphemeBreakTest.txt). UPDATE: Due to a small change in the segmenters semantics, currently 11 out of 672 tests fail. I did not have the time to look into it.
____________________________________________________________________________
License ¶
This project is provided under the terms of the UNLICENSE or the 3-Clause BSD license denoted by the following SPDX identifier:
SPDX-License-Identifier: 'Unlicense' OR 'BSD-3-Clause'
You may use the project under the terms of either license.
Licenses are reproduced in the license file in the root folder of this module.
Copyright © 2021 Norbert Pillmayer <norbert@pillmayer.com>
Index ¶
Constants ¶
const ( GlueBREAK int = -500 GlueJOIN int = 10000 GlueBANG int = -20000 )
GlueBREAK, JOIN and BANG set default penalty values.
const MaxByteLen int = 32766
MaxByteLen is the maximum byte count a grapheme string may consist of.
const Version = "11.0.0"
Version is the Unicode version this package conforms to.
Variables ¶
var CR, LF, Prepend, Control, Extend, Regional_Indicator, SpacingMark, L, V, T,
LV, LVT, ZWJ *unicode.RangeTable
Range tables for grapheme code-point classes. Will be initialized with SetupGraphemeClasses(). Clients can check with unicode.Is(..., rune)
Functions ¶
func SetupGraphemeClasses ¶
func SetupGraphemeClasses()
SetupGraphemeClasses is the top-level preparation function: Create code-point classes for grapheme breaking. Will in turn set up emoji classes as well. (Concurrency-safe).
Types ¶
type Breaker ¶
type Breaker struct {
// contains filtered or unexported fields
}
Breaker is a type to be used by a uax.Segmenter to break text up according to UAX#29 / Graphemes. It implements the uax.UnicodeBreaker interface.
func NewBreaker ¶
NewBreaker creates a new UAX#29 line breaker.
Usage:
onGraphemes := NewBreaker() segmenter := uax.NewSegmenter(onGraphemes) segmenter.Init(...) for segmenter.Next() ...
weight is a multilying factor for penalties. It must be 0…w…5 and will be capped for values outside this range.
func (*Breaker) CodePointClassFor ¶
CodePointClassFor returns the grapheme code-point class for a rune (= code-point). (Interface uax.UnicodeBreaker)
func (*Breaker) LongestActiveMatch ¶
LongestActiveMatch collects information from all active recognizers about current match length and return the longest one for all still active recognizers. (Interface uax.UnicodeBreaker)
func (*Breaker) Penalties ¶
Penalties gets all active penalties for all active recognizers combined. Index 0 belongs to the most recently read rune, i.e., represents the penalty for breaking after it. (Interface uax.UnicodeBreaker)
func (*Breaker) ProceedWithRune ¶
ProceedWithRune is a signal to a Breaker: A new code-point has been read and this breaker receives a message to consume it. (Interface uax.UnicodeBreaker)
func (*Breaker) StartRulesFor ¶
StartRulesFor starts all recognizers where the starting symbol is rune r. r is of code-point-class cpClass. (Interface uax.UnicodeBreaker)
TODO merge this with ProceedWithRune(), it is unnecessary
type GraphemeClass ¶
type GraphemeClass int
Type for UAX#29 grapheme code-point classes. Must be convertable to int.
const ( CRClass GraphemeClass = 0 LFClass GraphemeClass = 1 PrependClass GraphemeClass = 2 ControlClass GraphemeClass = 3 ExtendClass GraphemeClass = 4 Regional_IndicatorClass GraphemeClass = 5 SpacingMarkClass GraphemeClass = 6 LClass GraphemeClass = 7 VClass GraphemeClass = 8 TClass GraphemeClass = 9 LVClass GraphemeClass = 10 LVTClass GraphemeClass = 11 ZWJClass GraphemeClass = 12 Any GraphemeClass = 999 )
These are all the grapheme breaking classes.
func ClassForRune ¶
func ClassForRune(r rune) GraphemeClass
ClassForRune gets the line grapheme class for a Unicode code-point.
func (GraphemeClass) String ¶
func (c GraphemeClass) String() string
Stringer for type GraphemeClass
type String ¶
type String interface { Nth(int) string // return nth grapheme Len() int // length of string in units of user perceived characters String() string // return the underlying Go string }
String is a type to represent a graheme string, i.e. a sequence of “user perceived characters” as defined by Unicode. A grapheme string is a read-only data structure.
Finding graphemes from a string (or array of bytes) is an operation with runtime complexiy O(N). Clients should not convert large texts into grapheme strings in one go, but rather operate on manageable fragments.
func StringFromBytes ¶
StringFromBytes creates a grapheme string from an array of bytes. As grapheme strings are a read-only data structure, StringFromBytes will create a private copy of the input.
As grapheme strings are not meant to be created for large amounts of text, but rather for manageable segments, b is not allowed to exceed 2^16-1 = 32766 bytes.
StringFromBytes will panic if a larger input slice is given.
StringFromBytes will trim the input to valid Unicode code point (rune) boundaries. If b does not contain any legal runes, the resulting grapheme string may be of length 0 even if the input slice is not.
func StringFromString ¶
StringFromString creates a grapheme string from a Go string. As grapheme strings are not meant to be created for large amounts of text, but rather for manageable segments, s is not allowed to exceed 2^16-1 = 32766 bytes.
StringFromString will panic if a larger input string is given.
StringFromString will trim the input Go string to valid Unicode code point (rune) boundaries. If s does not contain any legal runes, the resulting grapheme string may be of length 0 even if the input string is not.