utf8

package module
v0.0.0-...-e474e88 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 31, 2024 License: MIT Imports: 3 Imported by: 9

README

go-utf8

Package utf8 implements encoding and decoding of UTF-8, for the Go programming language.

This package is meant to be a replacement for Go's built-in "unicode/utf8" package.

Documention

Online documentation, which includes examples, can be found at: http://godoc.org/sourcecode.social/reiver/go-utf8

GoDoc

Reading a Single UTF-8 Character

This is the simplest way of reading a single UTF-8 character.

var reader io.Reader

// ...

r, n, err := utf8.ReadRune(reader)

Write a Single UTF-8 Character

This is the simplest way of writing a single UTF-8 character.

var writer io.Writer

// ...

var r rune

// ...

n, err := utf8.WriteRune(w, r)

io.RuneReader

This is how you can create an io.RuneReader:

var reader io.Reader

// ...

var runeReader io.RuneReader = utf8.NewRuneReader(reader)

// ...

r, n, err := runeReader.ReadRune()

io.RuneScanner

This is how you can create an io.RuneScanner:

var reader io.Reader

// ...

var runeScanner io.RuneScanner := utf8.NewRuneScanner(reader)

// ...

r, n, err := runeScanner.ReadRune()

// ...

err = runeScanner.UnreadRune()

UTF-8

UTF-8 is a variable length encoding of Unicode. An encoding of a single Unicode code point can be from 1 to 4 bytes longs.

Some examples of UTF-8 encoding of Unicode code points are:

UTF-8 encoding value code point decimal binary name
byte 1 byte 2 byte 3 byte 4
0b0,1000001 A U+0041 65 0b0000,0000,0100,0001 LATIN CAPITAL LETTER A
0b0,1110010 r U+0072 114 0b0000,0000,0111,0010 LATIN SMALL LETTER R
0b110,00010 0b10,100001 ¡ U+00A1 161 0b0000,0000,1010,0001 INVERTED EXCLAMATION MARK
0b110,11011 0b10,110101 ۵ U+06F5 1781 0b0000,0110,1111,0101 EXTENDED ARABIC-INDIC DIGIT FIVE
0b1110,0010 0b10,000000 0b10,110001 U+2031 8241 0b0010,0000,0011,0001 PER TEN THOUSAND SIGN
0b1110,0010 0b10,001001 0b10,100001 U+2261 8801 0b0010,0010,0110,0001 IDENTICAL TO
0b11110,000 0b10,010000 0b10,001111 0b10,010101 𐏕 U+000103D5 66517 b0001,0000,0011,1101,0101 OLD PERSIAN NUMBER HUNDRED
0b11110,000 0b10,011111 0b10,011001 0b10,000010 🙂 U+0001F642 128578 0b0001,1111,0110,0100,0010 SLIGHTLY SMILING FACE

UTF-8 Versus ASCII

UTF-8 was (partially) designed to be backwards compatible with 7-bit ASCII.

Thus, all 7-bit ASCII is valid UTF-8.

UTF-8 Encoding

Since, at least as of 2003, Unicode fits into 21 bits, and thus UTF-8 was designed to support at most 21 bits of information.

This is done as described in the following table:

# of bytes # bits for code point 1st code point last code point byte 1 byte 2 byte 3 byte 4
1 7 U+000000 U+00007F 0xxxxxxx
2 11 U+000080 U+0007FF 110xxxxx 10xxxxxx
3 16 U+000800 U+00FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 U+010000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Documentation

Index

Constants

View Source
const (
	RuneError = '\uFFFD' // Unicode Replacement Character (U+FFFD).
)

Variables

This section is empty.

Functions

func FormatBinary

func FormatBinary(r rune) string

FormatBinary returns a representation of a rune as a sequence of bytes, given in binary format.

Example

utf8.FormatBinary('۵')

// Outputs:
// <<0b11011011 ; 0b10110101>>

func ReadRune

func ReadRune(reader io.Reader) (rune, int, error)

ReadRune reads a single UTF-8 encoded Unicode character from an io.Reader, and returns the Unicode character (as a Go rune) and the size of the rune.

Note that it returns the size-of-the-rune rather than the number-of-bytes-read. This is to match what is described in the Go built-in package:

“ReadRune reads a single encoded Unicode character and returns the rune and its size in bytes. If no character is available, err will be set.”

If ‘reader’ is nil then ReaderRune will return an error that matches utf8.NilReaderError.

Example

Here is an example usage of ReadRune:

     r, n, err := utf8.ReadRune(reader)
     if nil != err {

             switch err.(type) {
             case utf8.NilReaderError:
                     //@TODO
             case utf8.InvalidUTF8Error:
                     //@TODO
             default:
                     //TODO
             }
     }
	if utf8.RuneError == r {
		//@TODO
	}

Number Of Bytes

Note that a single UTF-8 encoded Unicode character could be more than one byte.

For example, the Unicode "≡" (IDENTICAL TO) character gets encoded using 3 bytes under UTF-8.

func RuneLength

func RuneLength(r rune) int

RuneLength returns the number of bytes in a UTF-8 encoding of this Unicode code point.

Example

length := utf8.RuneLength('A')

// length == 1

Example

length := utf8.RuneLength('r')

// length == 1

Example

length := utf8.RuneLength('¡')

// length == 2

Example

length := utf8.RuneLength('۵')

// length == 2

func WriteRune

func WriteRune(writer io.Writer, r rune) (int, error)

WriteRune writes a single UTF-8 encoded Unicode character and returns the number of bytes written.

If ‘writer’ is nil then WriteRune will return an error that matches utf8.NilWriterError.

Example

Here is an example usage of WriteRune:

n, err := utf8.WriteRune(writer, r)
if nil != err {

	switch err.(type) {
	case utf8.NilWriterError:
		//@TODO
	default:
		//TODO
	}

}

Types

type InvalidUTF8Error

type InvalidUTF8Error interface {
	error
	InvalidUTF8Error()
}

InvalidUTF8Error is a type of error that could be returned by the utf8.ReadRune() function, by the utf8.RuneReader.ReadRune() method, and by the utf8.RuneScanner.ReadRune() method.

Here is how one might use this type:

r, n, err := utf8.ReadRune(reader)
if nil != err {
	switch {
	case utf8.InvalidUTF8Error:
		//@TODO
	default:
		//@TODO
	}
}

type NilReaderError

type NilReaderError interface {
	error
	NilReaderError()
}

type NilWriterError

type NilWriterError interface {
	error
	NilWriterError()
}

type RuneReader

type RuneReader struct {
	// contains filtered or unexported fields
}

A utf8.RuneReader implements the io.RuneReader interface by reading from an io.Reader.

func NewRuneReader

func NewRuneReader(reader io.Reader) *RuneReader

func WrapRuneReader

func WrapRuneReader(reader io.Reader) RuneReader

func (RuneReader) ReadRune

func (receiver RuneReader) ReadRune() (rune, int, error)

type RuneScanner

type RuneScanner struct {
	// contains filtered or unexported fields
}

A utf8.RuneScanner implements the io.RuneScanner interface by reading from an io.Reader.

func NewRuneScanner

func NewRuneScanner(reader io.Reader) *RuneScanner

func WrapRuneScanner

func WrapRuneScanner(reader io.Reader) RuneScanner

func (*RuneScanner) Buffered

func (receiver *RuneScanner) Buffered() int

Buffered returns the number of bytes the UTF-8 encoding of the current buffered rune takes up, if there is a buffered rune.

A buffered rune would come from someone calleding .UnreadRune().

If there is not buffered rune then .Buffered() returns zero (0).

So, for example, if .UnreadRune() was called for the rune 'A' (U+0041), then .Buffered() would return 1.

Also, for example, if .UnreadRune() was called for the rune '۵' (U+06F5), then .Buffered() would return 2.

And, for example, if .UnreadRune() was called for the rune '≡' (U+2261), then .Buffered() would return 3.

And also, for example, if .UnreadRune() was called for the rune '🙂' (U+1F642), then .Buffered() would return 4.

This method has been made to be semantically the same as bufio.Reader.Buffered()

func (*RuneScanner) ReadRune

func (receiver *RuneScanner) ReadRune() (rune, int, error)

func (*RuneScanner) UnreadRune

func (receiver *RuneScanner) UnreadRune() error

type RuneWriter

type RuneWriter struct {
	// contains filtered or unexported fields
}

RuneWriter writes a single UTF-8 encoded Unicode characters.

func WrapRuneWriter

func WrapRuneWriter(writer io.Writer) RuneWriter

WrapRuneWriter wraps an io.Writer and returns a RuneWriter.

func (RuneWriter) WriteRune

func (receiver RuneWriter) WriteRune(r rune) (int, error)

WriteRune writes a single UTF-8 encoded Unicode character and returns the number of bytes written.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL