uniseg

package module
v0.4.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 8, 2024 License: MIT Imports: 4 Imported by: 0

README

Unicode Text Segmentation for Go

Go Reference Go Report

This Go package implements Unicode Text Segmentation according to Unicode Standard Annex #29, Unicode Line Breaking according to Unicode Standard Annex #14 (Unicode version 15.1.0), and monospace font string width calculation similar to wcwidth.

Background

Grapheme Clusters

In Go, strings are read-only slices of bytes. They can be turned into Unicode code points using the for loop or by casting: []rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples:

String Bytes (UTF-8) Code points (runes) Grapheme clusters
Käse 6 bytes: 4b 61 cc 88 73 65 5 code points: 4b 61 308 73 65 4 clusters: [4b],[61 308],[73],[65]
🏳️‍🌈 14 bytes: f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88 4 code points: 1f3f3 fe0f 200d 1f308 1 cluster: [1f3f3 fe0f 200d 1f308]
🇩🇪 8 bytes: f0 9f 87 a9 f0 9f 87 aa 2 code points: 1f1e9 1f1ea 1 cluster: [1f1e9 1f1ea]

This package provides tools to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Word Boundaries

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another. Searching may also use word boundaries in determining matching items. This package provides tools to determine word boundaries within strings.

Sentence Boundaries

Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides tools to determine sentence boundaries within strings.

Line Breaking

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters).

Monospace Width

Most terminals or text displays / text editors using a monospace font (for example source code editors) use a fixed width for each character. Some characters such as emojis or characters found in Asian and other languages may take up more than one character cell. This package provides tools to determine the number of cells a string will take up when displayed in a monospace font. See here for more information.

Installation

go get github.com/shogo82148/uniseg

Examples

Counting Characters in a String
n := uniseg.GraphemeClusterCount("🇩🇪🏳️‍🌈")
fmt.Println(n)
// 2
Calculating the Monospace String Width
width := uniseg.StringWidth("🇩🇪🏳️‍🌈!")
fmt.Println(width)
// 5
Using the Graphemes Class

This is the most convenient method of iterating over grapheme clusters:

gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
	fmt.Printf("%x ", gr.Runes())
}
// [1f44d 1f3fc] [21]
Using the Step or StepString Function

This is orders of magnitude faster than the Graphemes class, but it requires the handling of states and boundaries:

str := "🇩🇪🏳️‍🌈"
var c string
var state int
for len(str) > 0 {
	c, str, _, state = uniseg.StepString(str, state)
	fmt.Printf("%x ", []rune(c))
}
// [1f1e9 1f1ea] [1f3f3 fe0f 200d 1f308]
Advanced Examples

Breaking into grapheme clusters and evaluating line breaks:

str := "First line.\nSecond line."
var (
	c          string
	boundaries int
	state      int
)
for len(str) > 0 {
	c, str, boundaries, state = uniseg.StepString(str, state)
	fmt.Print(c)
	switch boundaries.Line() {
	case uniseg.LineCanBreak:
		fmt.Print("|")
	case uniseg.LineMustBreak:
		fmt.Print("‖")
	}
}
// First |line.
// ‖Second |line.‖

If you're only interested in word segmentation, use FirstWord or FirstWordInString:

str := "Hello, world!"
var state WordBreakState
var c string
for len(str) > 0 {
	c, str, state = uniseg.FirstWordInString(str, state)
	fmt.Printf("(%s)\n", c)
}
// (Hello)
// (,)
// ( )
// (world)
// (!)

Similarly, use

Finally, if you need to reverse a string while preserving grapheme clusters, use ReverseString:

fmt.Println(uniseg.ReverseString("🇩🇪🏳️‍🌈"))
// 🏳️‍🌈🇩🇪

Documentation

Refer to https://pkg.go.dev/github.com/shogo82148/uniseg for the package's documentation.

Dependencies

This package does not depend on any packages outside the standard library.

Sponsor this Project

Become a Sponsor on GitHub to support this project!

Your Feedback

Add your issue here on GitHub, preferably before submitting any PR's. Feel free to get in touch if you have any questions.

Documentation

Overview

Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and string width calculation for monospace fonts. Unicode Text Segmentation conforms to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode Line Breaking conforms to Unicode Standard Annex #14 (https://unicode.org/reports/tr14/).

In short, using this package, you can split a string into grapheme clusters (what people would usually refer to as a "character"), into words, and into sentences. Or, in its simplest case, this package allows you to count the number of characters in a string, especially when it contains complex characters such as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or other languages. Additionally, you can use it to implement line breaking (or "word wrapping"), that is, to determine where text can be broken over to the next line when the width of the line is not big enough to fit the entire text. Finally, you can use it to calculate the display width of a string for monospace fonts.

Getting Started

If you just want to count the number of characters in a string, you can use GraphemeClusterCount. If you want to determine the display width of a string, you can use StringWidth. If you want to iterate over a string, you can use Step, StepString, or the Graphemes class (more convenient but less performant). This will provide you with all information: grapheme clusters, word boundaries, sentence boundaries, line breaks, and monospace character widths. The specialized functions FirstGraphemeCluster, FirstGraphemeClusterInString, FirstWord, FirstWordInString, FirstSentence, and FirstSentenceInString can be used if only one type of information is needed.

Grapheme Clusters

Consider the rainbow flag emoji: 🏳️‍🌈. On most modern systems, it appears as one character. But its string representation actually has 14 bytes, so counting bytes (or using len("🏳️‍🌈")) will not work as expected. Counting runes won't, either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function utf8.RuneCountInString("🏳️‍🌈") and len([]rune("🏳️‍🌈")) will both return 4.

The GraphemeClusterCount function will return 1 for the rainbow flag emoji. The Graphemes class and a variety of functions in this package will allow you to split strings into its grapheme clusters.

Word Boundaries

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. This package provides methods for determining word boundaries.

Sentence Boundaries

Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides methods for determining sentence boundaries.

Line Breaking

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides methods to determine the positions in a string where a line must be broken, may be broken, or must not be broken.

Monospace Width

Monospace width, as referred to in this package, is the width of a string in a monospace font. This is commonly used in terminal user interfaces or text displays or editors that don't support proportional fonts. A width of 1 corresponds to a single character cell. The C function wcswidth() and its implementation in other programming languages is in widespread use for the same purpose. However, there is no standard for the calculation of such widths, and this package differs from wcswidth() in a number of ways, presumably to generate more visually pleasing results.

To start, we assume that every code point has a width of 1, with the following exceptions:

  • Code points with grapheme cluster break properties Control, CR, LF, Extend, and ZWJ have a width of 0.
  • U+2E3A, Two-Em Dash, has a width of 3.
  • U+2E3B, Three-Em Dash, has a width of 4.
  • Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide" (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both have a width of 1.)
  • Code points with grapheme cluster break property Regional Indicator have a width of 2.
  • Code points with grapheme cluster break property Extended Pictographic have a width of 2, unless their Emoji Presentation flag is "No", in which case the width is 1.

For Hangul grapheme clusters composed of conjoining Jamo and for Regional Indicators (flags), all code points except the first one have a width of 0. For grapheme clusters starting with an Extended Pictographic, any additional code point will force a total width of 2, except if the Variation Selector-15 (U+FE0E) is included, in which case the total width is always 1. Grapheme clusters ending with Variation Selector-16 (U+FE0F) have a width of 2.

Note that whether these widths appear correct depends on your application's render engine, to which extent it conforms to the Unicode Standard, and its choice of font.

Index

Examples

Constants

This section is empty.

Variables

View Source
var DefaultParser = defaultParser()

Functions

func GraphemeClusterCount

func GraphemeClusterCount(s string) (n int)

GraphemeClusterCount returns the number of user-perceived characters (grapheme clusters) for the given string.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	n := uniseg.GraphemeClusterCount("🇩🇪🏳️\u200d🌈")
	fmt.Println(n)
}
Output:

2

func HasTrailingLineBreak

func HasTrailingLineBreak(b []byte) bool

HasTrailingLineBreak returns true if the last rune in the given byte slice is one of the hard line break code points defined in LB4 and LB5 of UAX #14.

func HasTrailingLineBreakInString

func HasTrailingLineBreakInString(str string) bool

HasTrailingLineBreakInString is like HasTrailingLineBreak but for a string.

func IsEastAsian

func IsEastAsian() bool

IsEastAsian return true if the current locale is CJK

func ReverseString

func ReverseString(s string) string

ReverseString reverses the given string while observing grapheme cluster boundaries.

func Step

func Step(b []byte, state State) (cluster, rest []byte, boundaries Boundaries, newState State)

Step returns the first grapheme cluster (user-perceived character) found in the given byte slice. It also returns information about the boundary between that grapheme cluster and the one following it as well as the monospace width of the grapheme cluster. There are three types of boundary information: word boundaries, sentence boundaries, and line breaks. This function is therefore a combination of FirstGraphemeCluster, FirstWord, FirstSentence, and FirstLineSegment.

This function can be called continuously to extract all grapheme clusters from a byte slice, as illustrated in the examples below.

If you don't know which state to pass, for example when calling the function for the first time, you must pass 0. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified grapheme cluster. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "cluster" byte slice is the sub-slice of the input slice containing the first identified grapheme cluster.

Given an empty byte slice "b", the function returns nil values.

While slightly less convenient than using the Graphemes class, this function has much better performance and makes no allocations. It lends itself well to large byte slices.

Note that in accordance with UAX #14 LB3, the final segment will end with a mandatory line break (boundaries&maskLine == LineMustBreak). You can choose to ignore this by checking if the length of the "rest" slice is 0 and calling HasTrailingLineBreak or HasTrailingLineBreakInString on the last rune.

Example (Graphemes)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	b := []byte("🇩🇪🏳️\u200d🌈!")
	var (
		c          []byte
		boundaries uniseg.Boundaries
		state      uniseg.State
	)
	for len(b) > 0 {
		c, b, boundaries, state = uniseg.Step(b, state)
		fmt.Println(string(c), boundaries.Width())
	}
}
Output:

🇩🇪 2
🏳️‍🌈 2
! 1
Example (LineBreaking)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	b := []byte("First line.\nSecond line.")
	var (
		c          []byte
		boundaries uniseg.Boundaries
		state      uniseg.State
	)
	for len(b) > 0 {
		c, b, boundaries, state = uniseg.Step(b, state)
		fmt.Print(string(c))
		switch boundaries.Line() {
		case uniseg.LineCanBreak:
			fmt.Print("|")
		case uniseg.LineMustBreak:
			fmt.Print("‖")
		}
	}
}
Output:

First |line.
‖Second |line.‖
Example (Sentence)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	b := []byte("This is sentence 1.0. And this is sentence two.")
	var (
		c          []byte
		boundaries uniseg.Boundaries
		state      uniseg.State
	)
	for len(b) > 0 {
		c, b, boundaries, state = uniseg.Step(b, state)
		fmt.Print(string(c))
		if boundaries.Sentence() {
			fmt.Print("|")
		}
	}
}
Output:

This is sentence 1.0. |And this is sentence two.|
Example (Word)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	b := []byte("Hello, world!")
	var (
		c          []byte
		boundaries uniseg.Boundaries
		state      uniseg.State
	)
	for len(b) > 0 {
		c, b, boundaries, state = uniseg.Step(b, state)
		fmt.Print(string(c))
		if boundaries.Word() {
			fmt.Print("|")
		}
	}
}
Output:

Hello|,| |world|!|

func StepString

func StepString(str string, state State) (cluster, rest string, boundaries Boundaries, newState State)

StepString is like Step but its input and outputs are strings.

Example (Graphemes)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	str := "🇩🇪🏳️\u200d🌈!"
	var c string
	var state uniseg.State
	for len(str) > 0 {
		var boundaries uniseg.Boundaries
		c, str, boundaries, state = uniseg.StepString(str, state)
		fmt.Println(string(c), boundaries.Width())
	}
}
Output:

🇩🇪 2
🏳️‍🌈 2
! 1
Example (LineBreaking)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	str := "First line.\nSecond line."
	var (
		c          string
		boundaries uniseg.Boundaries
		state      uniseg.State
	)
	for len(str) > 0 {
		c, str, boundaries, state = uniseg.StepString(str, state)
		fmt.Print(c)
		switch boundaries.Line() {
		case uniseg.LineCanBreak:
			fmt.Print("|")
		case uniseg.LineMustBreak:
			fmt.Print("‖")
		}
	}
}
Output:

First |line.
‖Second |line.‖
Example (Sentence)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	str := "This is sentence 1.0. And this is sentence two."
	var (
		c          string
		boundaries uniseg.Boundaries
		state      uniseg.State
	)
	for len(str) > 0 {
		c, str, boundaries, state = uniseg.StepString(str, state)
		fmt.Print(c)
		if boundaries.Sentence() {
			fmt.Print("|")
		}
	}
}
Output:

This is sentence 1.0. |And this is sentence two.|
Example (Word)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	str := "Hello, world!"
	var (
		c          string
		boundaries uniseg.Boundaries
		state      uniseg.State
	)
	for len(str) > 0 {
		c, str, boundaries, state = uniseg.StepString(str, state)
		fmt.Print(c)
		if boundaries.Word() {
			fmt.Print("|")
		}
	}
}
Output:

Hello|,| |world|!|

func StringWidth

func StringWidth(s string) (width int)

StringWidth returns the monospace width for the given string, that is, the number of same-size cells to be occupied by the string.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	fmt.Println(uniseg.StringWidth("Hello, 世界"))
}
Output:

11

Types

type Boundaries

type Boundaries int

Boundaries is the type of the boundary information returned by Step.

func (Boundaries) Line

func (b Boundaries) Line() LineBreak

Line returns the line break information from b.

func (Boundaries) Sentence

func (b Boundaries) Sentence() bool

Sentence returns the sentence break information from b.

func (Boundaries) Width

func (b Boundaries) Width() int

Width returns the width information from b.

func (Boundaries) Word

func (b Boundaries) Word() bool

Word returns the word break information from b.

type GraphemeBreakState

type GraphemeBreakState int

GraphemeBreakState the type of the grapheme cluster parser's states.

func FirstGraphemeCluster

func FirstGraphemeCluster(b []byte, state GraphemeBreakState) (cluster, rest []byte, width int, newState GraphemeBreakState)

FirstGraphemeCluster returns the first grapheme cluster found in the given byte slice according to the rules of Unicode Standard Annex #29, Grapheme Cluster Boundaries. This function can be called continuously to extract all grapheme clusters from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass 0. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified grapheme cluster. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "cluster" byte slice is the sub-slice of the input slice containing the identified grapheme cluster.

The returned width is the width of the grapheme cluster for most monospace fonts where a value of 1 represents one character cell.

Given an empty byte slice "b", the function returns nil values.

While slightly less convenient than using the Graphemes class, this function has much better performance and makes no allocations. It lends itself well to large byte slices.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	b := []byte("🇩🇪🏳️\u200d🌈!")
	var state uniseg.GraphemeBreakState
	var c []byte
	for len(b) > 0 {
		var width int
		c, b, width, state = uniseg.FirstGraphemeCluster(b, state)
		fmt.Println(string(c), width)
	}
}
Output:

🇩🇪 2
🏳️‍🌈 2
! 1

func FirstGraphemeClusterInString

func FirstGraphemeClusterInString(str string, state GraphemeBreakState) (cluster, rest string, width int, newState GraphemeBreakState)

FirstGraphemeClusterInString is like FirstGraphemeCluster but its input and outputs are strings.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	str := "🇩🇪🏳️\u200d🌈!"
	var state uniseg.GraphemeBreakState
	var c string
	for len(str) > 0 {
		var width int
		c, str, width, state = uniseg.FirstGraphemeClusterInString(str, state)
		fmt.Println(c, width)
	}
}
Output:

🇩🇪 2
🏳️‍🌈 2
! 1

type Graphemes

type Graphemes struct {
	// contains filtered or unexported fields
}

Graphemes implements an iterator over Unicode grapheme clusters, or user-perceived characters. While iterating, it also provides information about word boundaries, sentence boundaries, line breaks, and monospace character widths.

After constructing the class via NewGraphemes for a given string "str", Graphemes.Next is called for every grapheme cluster in a loop until it returns false. Inside the loop, information about the grapheme cluster as well as boundary information and character width is available via the various methods (see examples below).

Using this class to iterate over a string is convenient but it is much slower than using this package's Step or StepString functions or any of the other specialized functions starting with "First".

Example (Graphemes)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	g := uniseg.NewGraphemes("🇩🇪🏳️\u200d🌈")
	for g.Next() {
		fmt.Println(g.Str())
	}
}
Output:

🇩🇪
🏳️‍🌈
Example (LineBreaking)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	g := uniseg.NewGraphemes("First line.\nSecond line.")
	for g.Next() {
		fmt.Print(g.Str())
		switch g.LineBreak() {
		case uniseg.LineCanBreak:
			fmt.Print("|")
		case uniseg.LineMustBreak:
			fmt.Print("‖")
		}
	}
}
Output:

First |line.
‖Second |line.‖
Example (Sentence)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	g := uniseg.NewGraphemes("This is sentence 1.0. And this is sentence two.")
	for g.Next() {
		fmt.Print(g.Str())
		if g.IsSentenceBoundary() {
			fmt.Print("|")
		}
	}
}
Output:

This is sentence 1.0. |And this is sentence two.|
Example (Word)
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	g := uniseg.NewGraphemes("Hello, world!")
	for g.Next() {
		fmt.Print(g.Str())
		if g.IsWordBoundary() {
			fmt.Print("|")
		}
	}
}
Output:

Hello|,| |world|!|

func NewGraphemes

func NewGraphemes(str string) *Graphemes

NewGraphemes returns a new grapheme cluster iterator.

func (*Graphemes) Bytes

func (g *Graphemes) Bytes() []byte

Bytes returns a byte slice which corresponds to the current grapheme cluster. If the iterator is already past the end or Graphemes.Next has not yet been called, nil is returned.

func (*Graphemes) IsSentenceBoundary

func (g *Graphemes) IsSentenceBoundary() bool

IsSentenceBoundary returns true if a sentence ends after the current grapheme cluster.

func (*Graphemes) IsWordBoundary

func (g *Graphemes) IsWordBoundary() bool

IsWordBoundary returns true if a word ends after the current grapheme cluster.

func (*Graphemes) LineBreak

func (g *Graphemes) LineBreak() LineBreak

LineBreak returns whether the line can be broken after the current grapheme cluster. A value of LineDontBreak means the line may not be broken, a value of LineMustBreak means the line must be broken, and a value of LineCanBreak means the line may or may not be broken.

func (*Graphemes) Next

func (g *Graphemes) Next() bool

Next advances the iterator by one grapheme cluster and returns false if no clusters are left. This function must be called before the first cluster is accessed.

func (*Graphemes) Positions

func (g *Graphemes) Positions() (int, int)

Positions returns the interval of the current grapheme cluster as byte positions into the original string. The first returned value "from" indexes the first byte and the second returned value "to" indexes the first byte that is not included anymore, i.e. str[from:to] is the current grapheme cluster of the original string "str". If Graphemes.Next has not yet been called, both values are 0. If the iterator is already past the end, both values are 1.

func (*Graphemes) Reset

func (g *Graphemes) Reset()

Reset puts the iterator into its initial state such that the next call to Graphemes.Next sets it to the first grapheme cluster again.

func (*Graphemes) Runes

func (g *Graphemes) Runes() []rune

Runes returns a slice of runes (code points) which corresponds to the current grapheme cluster. If the iterator is already past the end or Graphemes.Next has not yet been called, nil is returned.

func (*Graphemes) Str

func (g *Graphemes) Str() string

Str returns a substring of the original string which corresponds to the current grapheme cluster. If the iterator is already past the end or Graphemes.Next has not yet been called, an empty string is returned.

func (*Graphemes) Width

func (g *Graphemes) Width() int

Width returns the monospace width of the current grapheme cluster.

type LineBreak

type LineBreak int

LineBreak defines whether a given text may be broken into the next line.

const (
	LineDontBreak LineBreak = iota // You may not break the line here.
	LineCanBreak                   // You may or may not break the line here.
	LineMustBreak                  // You must break the line here.
)

These constants define whether a given text may be broken into the next line. If the break is optional (LineCanBreak), you may choose to break or not based on your own criteria, for example, if the text has reached the available width.

type LineBreakState

type LineBreakState int

LineBreakState is the type of the line break parser's states.

func FirstLineSegment

func FirstLineSegment(b []byte, state LineBreakState) (segment, rest []byte, mustBreak bool, newState LineBreakState)

FirstLineSegment returns the prefix of the given byte slice after which a decision to break the string over to the next line can or must be made, according to the rules of Unicode Standard Annex #14. This is used to implement line breaking.

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area.

The returned "segment" may not be broken into smaller parts, unless no other breaking opportunities present themselves, in which case you may break by grapheme clusters (using the FirstGraphemeCluster function to determine the grapheme clusters).

The "mustBreak" flag indicates whether you MUST break the line after the given segment (true), for example after newline characters, or you MAY break the line after the given segment (false).

This function can be called continuously to extract all non-breaking sub-sets from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass -1. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified line segment. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "segment" byte slice is the sub-slice of the input slice containing the identified line segment.

Given an empty byte slice "b", the function returns nil values.

Note that in accordance with UAX #14 LB3, the final segment will end with "mustBreak" set to true. You can choose to ignore this by checking if the length of the "rest" slice is 0 and calling HasTrailingLineBreak or HasTrailingLineBreakInString on the last rune.

Note also that this algorithm may break within grapheme clusters. This is addressed in Section 8.2 Example 6 of UAX #14. To avoid this, you can use the Step function instead.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	b := []byte("First line.\nSecond line.")
	var (
		c         []byte
		mustBreak bool
		state     uniseg.LineBreakState
	)
	for len(b) > 0 {
		c, b, mustBreak, state = uniseg.FirstLineSegment(b, state)
		fmt.Printf("(%s)", string(c))
		if mustBreak {
			fmt.Print("!")
		}
	}
}
Output:

(First )(line.
)!(Second )(line.)!

func FirstLineSegmentInString

func FirstLineSegmentInString(str string, state LineBreakState) (segment, rest string, mustBreak bool, newState LineBreakState)

FirstLineSegmentInString is like FirstLineSegment but its input and outputs are strings.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	str := "First line.\nSecond line."
	var (
		c         string
		mustBreak bool
		state     uniseg.LineBreakState
	)
	for len(str) > 0 {
		c, str, mustBreak, state = uniseg.FirstLineSegmentInString(str, state)
		fmt.Printf("(%s)", c)
		if mustBreak {
			fmt.Println(" < must break")
		} else {
			fmt.Println(" < may break")
		}
	}
}
Output:

(First ) < may break
(line.
) < must break
(Second ) < may break
(line.) < must break

type Parser

type Parser struct {
	// EastAsianWidth controls the width of characters
	// with the East Asian Width Ambiguous attribute.
	//
	// It it is true, the parser treats Unicode text
	// in the context of East Asian traditional character encodings.
	// The width of characters with the East Asian Width Ambiguous attribute is 2.
	//
	// It it is false, the parser treats Unicode text
	// in the context of non-East Asian traditional character encodings.
	// The width of characters with the East Asian Width Ambiguous attribute is 1.
	EastAsianWidth bool

	// WideEmoji controls the width of Emoji characters.
	// [UAX #11] recommends that Emoji characters should be rendered
	// with a width of 1, however some fonts render Emoji characters
	// wider than other characters.
	// WideEmoji is used to maintain compatibility with such fonts.
	//
	// If it is true, the width of Emoji characters is 2.
	// Otherwise, the width of Emoji characters is 1.
	// It is effective only when EastAsianWidth is true.
	//
	// [UAX #11]: https://www.unicode.org/reports/tr11/tr11-40.html
	WideEmoji bool
}

Parser is a parser for Unicode text.

func (*Parser) FirstGraphemeCluster

func (p *Parser) FirstGraphemeCluster(b []byte, state GraphemeBreakState) (cluster, rest []byte, width int, newState GraphemeBreakState)

FirstGraphemeCluster returns the first grapheme cluster found in the given byte slice according to the rules of Unicode Standard Annex #29, Grapheme Cluster Boundaries. This function can be called continuously to extract all grapheme clusters from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass 0. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified grapheme cluster. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "cluster" byte slice is the sub-slice of the input slice containing the identified grapheme cluster.

The returned width is the width of the grapheme cluster for most monospace fonts where a value of 1 represents one character cell.

Given an empty byte slice "b", the function returns nil values.

While slightly less convenient than using the Graphemes class, this function has much better performance and makes no allocations. It lends itself well to large byte slices.

func (*Parser) FirstGraphemeClusterInString

func (p *Parser) FirstGraphemeClusterInString(str string, state GraphemeBreakState) (cluster, rest string, width int, newState GraphemeBreakState)

FirstGraphemeClusterInString is like Parser.FirstGraphemeCluster but its input and outputs are strings.

func (*Parser) FirstLineSegment

func (*Parser) FirstLineSegment(b []byte, state LineBreakState) (segment, rest []byte, mustBreak bool, newState LineBreakState)

FirstLineSegment returns the prefix of the given byte slice after which a decision to break the string over to the next line can or must be made, according to the rules of Unicode Standard Annex #14. This is used to implement line breaking.

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area.

The returned "segment" may not be broken into smaller parts, unless no other breaking opportunities present themselves, in which case you may break by grapheme clusters (using the FirstGraphemeCluster function to determine the grapheme clusters).

The "mustBreak" flag indicates whether you MUST break the line after the given segment (true), for example after newline characters, or you MAY break the line after the given segment (false).

This function can be called continuously to extract all non-breaking sub-sets from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass -1. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified line segment. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "segment" byte slice is the sub-slice of the input slice containing the identified line segment.

Given an empty byte slice "b", the function returns nil values.

Note that in accordance with UAX #14 LB3, the final segment will end with "mustBreak" set to true. You can choose to ignore this by checking if the length of the "rest" slice is 0 and calling HasTrailingLineBreak or HasTrailingLineBreakInString on the last rune.

Note also that this algorithm may break within grapheme clusters. This is addressed in Section 8.2 Example 6 of UAX #14. To avoid this, you can use the Step function instead.

func (*Parser) FirstLineSegmentInString

func (*Parser) FirstLineSegmentInString(str string, state LineBreakState) (segment, rest string, mustBreak bool, newState LineBreakState)

FirstLineSegmentInString is like Parser.FirstLineSegment but its input and outputs are strings.

func (*Parser) FirstSentence

func (*Parser) FirstSentence(b []byte, state SentenceBreakState) (sentence, rest []byte, newState SentenceBreakState)

FirstSentence returns the first sentence found in the given byte slice according to the rules of Unicode Standard Annex #29, Sentence Boundaries. This function can be called continuously to extract all sentences from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass 0. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified sentence. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "sentence" byte slice is the sub-slice of the input slice containing the identified sentence.

Given an empty byte slice "b", the function returns nil values.

func (*Parser) FirstSentenceInString

func (*Parser) FirstSentenceInString(str string, state SentenceBreakState) (sentence, rest string, newState SentenceBreakState)

FirstSentenceInString is like Parser.FirstSentence but its input and outputs are strings.

func (*Parser) FirstWord

func (*Parser) FirstWord(b []byte, state WordBreakState) (word, rest []byte, newState WordBreakState)

FirstWord returns the first word found in the given byte slice according to the rules of Unicode Standard Annex #29, Word Boundaries. This function can be called continuously to extract all words from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass 0. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified word. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "word" byte slice is the sub-slice of the input slice containing the identified word.

Given an empty byte slice "b", the function returns nil values.

func (*Parser) FirstWordInString

func (*Parser) FirstWordInString(str string, state WordBreakState) (word, rest string, newState WordBreakState)

FirstWordInString is like Parser.FirstWord but its input and outputs are strings.

func (*Parser) GraphemeClusterCount

func (p *Parser) GraphemeClusterCount(s string) (n int)

GraphemeClusterCount returns the number of user-perceived characters (grapheme clusters) for the given string.

func (*Parser) HasTrailingLineBreak

func (*Parser) HasTrailingLineBreak(b []byte) bool

HasTrailingLineBreak returns true if the last rune in the given byte slice is one of the hard line break code points defined in LB4 and LB5 of UAX #14.

func (*Parser) HasTrailingLineBreakInString

func (*Parser) HasTrailingLineBreakInString(str string) bool

HasTrailingLineBreakInString is like HasTrailingLineBreak but for a string.

func (*Parser) NewGraphemes

func (p *Parser) NewGraphemes(str string) *Graphemes

func (*Parser) Step

func (p *Parser) Step(b []byte, state State) (cluster, rest []byte, boundaries Boundaries, newState State)

Step returns the first grapheme cluster (user-perceived character) found in the given byte slice. It also returns information about the boundary between that grapheme cluster and the one following it as well as the monospace width of the grapheme cluster. There are three types of boundary information: word boundaries, sentence boundaries, and line breaks. This function is therefore a combination of FirstGraphemeCluster, FirstWord, FirstSentence, and FirstLineSegment.

This function can be called continuously to extract all grapheme clusters from a byte slice, as illustrated in the examples below.

If you don't know which state to pass, for example when calling the function for the first time, you must pass 0. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified grapheme cluster. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "cluster" byte slice is the sub-slice of the input slice containing the first identified grapheme cluster.

Given an empty byte slice "b", the function returns nil values.

While slightly less convenient than using the Graphemes class, this function has much better performance and makes no allocations. It lends itself well to large byte slices.

Note that in accordance with UAX #14 LB3, the final segment will end with a mandatory line break (boundaries&maskLine == LineMustBreak). You can choose to ignore this by checking if the length of the "rest" slice is 0 and calling HasTrailingLineBreak or HasTrailingLineBreakInString on the last rune.

func (*Parser) StepString

func (p *Parser) StepString(str string, state State) (cluster, rest string, boundaries Boundaries, newState State)

StepString is like Parser.Step but its input and outputs are strings.

func (*Parser) StringWidth

func (p *Parser) StringWidth(s string) (width int)

StringWidth returns the monospace width for the given string, that is, the number of same-size cells to be occupied by the string.

type SentenceBreakState

type SentenceBreakState int

SentenceBreakState is the state of the sentence break parser.

func FirstSentence

func FirstSentence(b []byte, state SentenceBreakState) (sentence, rest []byte, newState SentenceBreakState)

FirstSentence returns the first sentence found in the given byte slice according to the rules of Unicode Standard Annex #29, Sentence Boundaries. This function can be called continuously to extract all sentences from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass 0. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified sentence. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "sentence" byte slice is the sub-slice of the input slice containing the identified sentence.

Given an empty byte slice "b", the function returns nil values.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	b := []byte("This is sentence 1.0. And this is sentence two.")
	var state uniseg.SentenceBreakState
	var c []byte
	for len(b) > 0 {
		c, b, state = uniseg.FirstSentence(b, state)
		fmt.Printf("(%s)\n", string(c))
	}
}
Output:

(This is sentence 1.0. )
(And this is sentence two.)

func FirstSentenceInString

func FirstSentenceInString(str string, state SentenceBreakState) (sentence, rest string, newState SentenceBreakState)

FirstSentenceInString is like FirstSentence but its input and outputs are strings.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	str := "This is sentence 1.0. And this is sentence two."
	var state uniseg.SentenceBreakState
	var c string
	for len(str) > 0 {
		c, str, state = uniseg.FirstSentenceInString(str, state)
		fmt.Printf("(%s)\n", c)
	}
}
Output:

(This is sentence 1.0. )
(And this is sentence two.)

type State

type State int

State is the type of the state of the Step parser.

type WordBreakState

type WordBreakState int

WordBreakState is the type of the word break parser's states.

func FirstWord

func FirstWord(b []byte, state WordBreakState) (word, rest []byte, newState WordBreakState)

FirstWord returns the first word found in the given byte slice according to the rules of Unicode Standard Annex #29, Word Boundaries. This function can be called continuously to extract all words from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass 0. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified word. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "word" byte slice is the sub-slice of the input slice containing the identified word.

Given an empty byte slice "b", the function returns nil values.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	b := []byte("Hello, world!")
	var state uniseg.WordBreakState
	var c []byte
	for len(b) > 0 {
		c, b, state = uniseg.FirstWord(b, state)
		fmt.Printf("(%s)\n", string(c))
	}
}
Output:

(Hello)
(,)
( )
(world)
(!)

func FirstWordInString

func FirstWordInString(str string, state WordBreakState) (word, rest string, newState WordBreakState)

FirstWordInString is like FirstWord but its input and outputs are strings.

Example
package main

import (
	"fmt"

	"github.com/shogo82148/uniseg"
)

func main() {
	str := "Hello, world!"
	var state uniseg.WordBreakState
	var c string
	for len(str) > 0 {
		c, str, state = uniseg.FirstWordInString(str, state)
		fmt.Printf("(%s)\n", c)
	}
}
Output:

(Hello)
(,)
( )
(world)
(!)

Directories

Path Synopsis
internal
cmd/gen_properties
This program generates a property file in Go file from Unicode Character Database auxiliary data files.
This program generates a property file in Go file from Unicode Character Database auxiliary data files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL