goutfs

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 27, 2023 License: BSD-2-Clause Imports: 1 Imported by: 1

README

goutfs

Go UTF-8 string.

Provides a String structure type that allows per-character addressing, slicing, and truncation, all while ensuring characters that require multiple code points are not split mid character.

Example Use

Documentation is available via GoDoc and https://pkg.go.dev/github.com/karrick/goutfs?tab=doc.

func ExampleString() {
    s := goutfs.NewString("cafés")
    fmt.Println(s.Len())
    fmt.Println(string(s.Char(3)))
    fmt.Println(string(s.Slice(0, 4)))
    fmt.Println(string(s.Slice(4, -1)))
    s.Trunc(3)
    fmt.Println(string(s.Bytes()))
    // Output:
    // 5
    // é
    // café
    // s
    // caf
}

Definitions

For the purposes of this library, I have attempted to adopt the universal and Go specific terminology for characters, code points, runes, and bytes. There is a chance that I misread a resource and have an error in my terminology, but a best effort has been attempted.

Character

Each character occupies a single column in the output, and roughly corresponds to what a human sees when they look at the printed text. A human might see the latin letter e with an accent grave over it, for example.

Characters are stored and transmitted using some encoding. In unicode parlance those encodings are called code points. Because of how combining characters work in unicode, some characters could have multiple code point representations. For instance, the lower case letter e with an accent grave could be encoded as a single unicode code point, or alternatively, encoded by two code points. Namely, the first codepoint would be the lower case latin letter e, the second as what is known as a combining code point. In this case the combining code point would be the accent grave combining code point. Both of these representations result in the same character being displayed, but each would have a different sequence of bytes to encode the character. There are libraries to normalize these encodings to one of various canonical standards. However, I am not certain character normalization needs to be addressed in this library.

Code Point, a.k.a. Go rune

A code point is called a rune in Go parlance. A Go rune is stored as an int32 value. Remember a rune is not necessarily a single character. Some characters have multiple unicode encodings, each of which could be single or multiple code points.

Another point--no pun intended--is there are look alike characters in unicode. Not just different code point sequences that represent the same character, but two different characters that happen to look alike. For instance, the latin capitol K looks identical to the unicode code point for the Kelvin symbol. This library need not worry itself with look alike characters. In order to function correctly, this library merely needs to know at one byte offset a particular character ends and the next character begins.

Strings

Go has no restrictions on the sequence of bytes stored in a string. The only restriction Go puts on the bytes in a string are that Go source code is defined as UTF-8, which means most string literal values are valid UTF-8 encodings. This is not always the case, however, as Go allows byte level escapes to be included in string literals, which may or may not represent valid UTF-8 encoded data.

Iterating over a UTF-8 string will result in some runes that require multiple bytes, and other runes that require a single byte.

Starting vs Non-Starting (Combining) Rune

Unicode defines many code points that are called starting code points. They may be displayed independently of any other code point, and the may be modified indefinitely by appending what are called non-starting code points. Non-starting code points are more frequently called combining code points in literature. Each valid unicode character sequence starts with a starting code point, and be followed by zero or more non-starting code points.

References

  1. https://blog.golang.org/strings
  2. https://blog.golang.org/normalization
  3. https://pkg.go.dev/golang.org/x/text/transform?tab=doc

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type String

type String struct {
	// contains filtered or unexported fields
}

String is a UTF-8 encoded string that allows per-character addressing, slicing, and truncation.

Example
s := NewString("cafés")
fmt.Println(s.Len())
fmt.Println(string(s.Char(3)))
fmt.Println(string(s.Slice(0, 4)))
fmt.Println(string(s.Slice(4, -1)))
s.Trunc(3)
fmt.Println(string(s.Bytes()))
Output:

5
é
café
s
caf

func NewString

func NewString(s string) *String

NewString returns a new String by evaluating its input as UTF-8 sequence of bytes, and storing the offset to each addressable character.

func ExampleString() {
    s := NewString("cafés")
    fmt.Println(s.Len())
    fmt.Println(string(s.Char(3)))
    fmt.Println(string(s.Slice(0, 4)))
    fmt.Println(string(s.Slice(4, -1)))
    s.Trunc(3)
    fmt.Println(string(s.Bytes()))
    // Output:
    // 5
    // é
    // café
    // s
    // caf
}

func (*String) Bytes

func (s *String) Bytes() []byte

Bytes returns the entire slice of bytes that encode all characters in the String.

func (*String) Char

func (s *String) Char(i int) []byte

Char returns the slice of bytes that encode the Ith character.

func (*String) Len

func (s *String) Len() int

Len returns the number of characters in the String.

func (*String) Slice

func (s *String) Slice(i, j int) []byte

Slice returns the slice of bytes that encode the Ith thru Jth-1 characters of the String. As two special cases, when i is -1, this returns nil; or when j is -1, this returns from the Ith character to the end of the string.

func (*String) Trunc

func (s *String) Trunc(i int)

Trunc truncates the String to max of i characters. As a special case the String is truncated to the empty string when i is less than or equal to 0. No operation is performed when i is greater than or equal to the number of characters in the String.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL