utf8

package module

v0.0.0-...-e474e88 Latest Latest Go to latest Published: Mar 31, 2024 License: MIT Imports: 3 Imported by: 9

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

sourcecode.social/reiver/go-utf8

Links

Open Source Insights

README ¶

go-utf8

Package utf8 implements encoding and decoding of UTF-8, for the Go programming language.

This package is meant to be a replacement for Go's built-in "unicode/utf8" package.

Documention

Online documentation, which includes examples, can be found at: http://godoc.org/sourcecode.social/reiver/go-utf8

Reading a Single UTF-8 Character

This is the simplest way of reading a single UTF-8 character.

var reader io.Reader

// ...

r, n, err := utf8.ReadRune(reader)

Write a Single UTF-8 Character

This is the simplest way of writing a single UTF-8 character.

var writer io.Writer

// ...

var r rune

// ...

n, err := utf8.WriteRune(w, r)

io.RuneReader

This is how you can create an io.RuneReader:

var reader io.Reader

// ...

var runeReader io.RuneReader = utf8.NewRuneReader(reader)

// ...

r, n, err := runeReader.ReadRune()

io.RuneScanner

This is how you can create an io.RuneScanner:

var reader io.Reader

// ...

var runeScanner io.RuneScanner := utf8.NewRuneScanner(reader)

// ...

r, n, err := runeScanner.ReadRune()

// ...

err = runeScanner.UnreadRune()

UTF-8

UTF-8 is a variable length encoding of Unicode. An encoding of a single Unicode code point can be from 1 to 4 bytes longs.

Some examples of UTF-8 encoding of Unicode code points are:

UTF-8 encoding				value	code point	decimal	binary	name
byte 1	byte 2	byte 3	byte 4	value	code point	decimal	binary	name
`0b0,1000001`				A	U+0041	65	`0b0000,0000,0100,0001`	LATIN CAPITAL LETTER A
`0b0,1110010`				r	U+0072	114	`0b0000,0000,0111,0010`	LATIN SMALL LETTER R
`0b110,00010`	`0b10,100001`			¡	U+00A1	161	`0b0000,0000,1010,0001`	INVERTED EXCLAMATION MARK
`0b110,11011`	`0b10,110101`			۵	U+06F5	1781	`0b0000,0110,1111,0101`	EXTENDED ARABIC-INDIC DIGIT FIVE
`0b1110,0010`	`0b10,000000`	`0b10,110001`		‱	U+2031	8241	`0b0010,0000,0011,0001`	PER TEN THOUSAND SIGN
`0b1110,0010`	`0b10,001001`	`0b10,100001`		≡	U+2261	8801	`0b0010,0010,0110,0001`	IDENTICAL TO
`0b11110,000`	`0b10,010000`	`0b10,001111`	`0b10,010101`	𐏕	U+000103D5	66517	`b0001,0000,0011,1101,0101`	OLD PERSIAN NUMBER HUNDRED
`0b11110,000`	`0b10,011111`	`0b10,011001`	`0b10,000010`	🙂	U+0001F642	128578	`0b0001,1111,0110,0100,0010`	SLIGHTLY SMILING FACE

UTF-8 Versus ASCII

UTF-8 was (partially) designed to be backwards compatible with 7-bit ASCII.

Thus, all 7-bit ASCII is valid UTF-8.

UTF-8 Encoding

Since, at least as of 2003, Unicode fits into 21 bits, and thus UTF-8 was designed to support at most 21 bits of information.

This is done as described in the following table:

# of bytes	# bits for code point	1st code point	last code point	byte 1	byte 2	byte 3	byte 4
1	7	U+000000	U+00007F	`0xxxxxxx`
2	11	U+000080	U+0007FF	`110xxxxx`	`10xxxxxx`
3	16	U+000800	U+00FFFF	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
4	21	U+010000	U+10FFFF	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

Documentation ¶

Index ¶

Constants
func FormatBinary(r rune) string
func ReadRune(reader io.Reader) (rune, int, error)
func RuneLength(r rune) int
func WriteRune(writer io.Writer, r rune) (int, error)
type InvalidUTF8Error
type NilReaderError
type NilWriterError
type RuneReader
- func NewRuneReader(reader io.Reader) *RuneReader
- func WrapRuneReader(reader io.Reader) RuneReader
- func (receiver RuneReader) ReadRune() (rune, int, error)
type RuneScanner
- func NewRuneScanner(reader io.Reader) *RuneScanner
- func WrapRuneScanner(reader io.Reader) RuneScanner
type RuneWriter
- func WrapRuneWriter(writer io.Writer) RuneWriter
- func (receiver RuneWriter) WriteRune(r rune) (int, error)

Constants ¶

View Source

const (
	RuneError = '\uFFFD' // Unicode Replacement Character (U+FFFD).
)

Variables ¶

This section is empty.

Functions ¶

func FormatBinary ¶

func FormatBinary(r rune) string

FormatBinary returns a representation of a rune as a sequence of bytes, given in binary format.

Example

utf8.FormatBinary('۵')

// Outputs:
// <<0b11011011 ; 0b10110101>>

func ReadRune ¶

func ReadRune(reader io.Reader) (rune, int, error)

ReadRune reads a single UTF-8 encoded Unicode character from an io.Reader, and returns the Unicode character (as a Go rune) and the size of the rune.

Note that it returns the size-of-the-rune rather than the number-of-bytes-read. This is to match what is described in the Go built-in package:

“ReadRune reads a single encoded Unicode character and returns the rune and its size in bytes. If no character is available, err will be set.”

If ‘reader’ is nil then ReaderRune will return an error that matches utf8.NilReaderError.

Example ¶

Here is an example usage of ReadRune:

     r, n, err := utf8.ReadRune(reader)
     if nil != err {

             switch err.(type) {
             case utf8.NilReaderError:
                     //@TODO
             case utf8.InvalidUTF8Error:
                     //@TODO
             default:
                     //TODO
             }
     }
	if utf8.RuneError == r {
		//@TODO
	}

Number Of Bytes ¶

Note that a single UTF-8 encoded Unicode character could be more than one byte.

For example, the Unicode "≡" (IDENTICAL TO) character gets encoded using 3 bytes under UTF-8.

func RuneLength ¶

func RuneLength(r rune) int

RuneLength returns the number of bytes in a UTF-8 encoding of this Unicode code point.

Example

length := utf8.RuneLength('A')

// length == 1

Example

length := utf8.RuneLength('r')

// length == 1

Example

length := utf8.RuneLength('¡')

// length == 2

Example

length := utf8.RuneLength('۵')

// length == 2

func WriteRune ¶

func WriteRune(writer io.Writer, r rune) (int, error)

WriteRune writes a single UTF-8 encoded Unicode character and returns the number of bytes written.

If ‘writer’ is nil then WriteRune will return an error that matches utf8.NilWriterError.

Example ¶

Here is an example usage of WriteRune:

n, err := utf8.WriteRune(writer, r)
if nil != err {

	switch err.(type) {
	case utf8.NilWriterError:
		//@TODO
	default:
		//TODO
	}

}

Types ¶

type InvalidUTF8Error ¶

type InvalidUTF8Error interface {
	error
	InvalidUTF8Error()
}

InvalidUTF8Error is a type of error that could be returned by the utf8.ReadRune() function, by the utf8.RuneReader.ReadRune() method, and by the utf8.RuneScanner.ReadRune() method.

Here is how one might use this type:

r, n, err := utf8.ReadRune(reader)
if nil != err {
	switch {
	case utf8.InvalidUTF8Error:
		//@TODO
	default:
		//@TODO
	}
}

type NilReaderError ¶

type NilReaderError interface {
	error
	NilReaderError()
}

type NilWriterError ¶

type NilWriterError interface {
	error
	NilWriterError()
}

type RuneReader ¶

type RuneReader struct {
	// contains filtered or unexported fields
}

A utf8.RuneReader implements the io.RuneReader interface by reading from an io.Reader.

func NewRuneReader ¶

func NewRuneReader(reader io.Reader) *RuneReader

func WrapRuneReader ¶

func WrapRuneReader(reader io.Reader) RuneReader

func (RuneReader) ReadRune ¶

func (receiver RuneReader) ReadRune() (rune, int, error)

type RuneScanner ¶

type RuneScanner struct {
	// contains filtered or unexported fields
}

A utf8.RuneScanner implements the io.RuneScanner interface by reading from an io.Reader.

func NewRuneScanner ¶

func NewRuneScanner(reader io.Reader) *RuneScanner

func WrapRuneScanner ¶

func WrapRuneScanner(reader io.Reader) RuneScanner

func (*RuneScanner) Buffered ¶

func (receiver *RuneScanner) Buffered() int

Buffered returns the number of bytes the UTF-8 encoding of the current buffered rune takes up, if there is a buffered rune.

A buffered rune would come from someone calleding .UnreadRune().

If there is not buffered rune then .Buffered() returns zero (0).

So, for example, if .UnreadRune() was called for the rune 'A' (U+0041), then .Buffered() would return 1.

Also, for example, if .UnreadRune() was called for the rune '۵' (U+06F5), then .Buffered() would return 2.

And, for example, if .UnreadRune() was called for the rune '≡' (U+2261), then .Buffered() would return 3.

And also, for example, if .UnreadRune() was called for the rune '🙂' (U+1F642), then .Buffered() would return 4.

This method has been made to be semantically the same as bufio.Reader.Buffered()

func (*RuneScanner) ReadRune ¶

func (receiver *RuneScanner) ReadRune() (rune, int, error)

func (*RuneScanner) UnreadRune ¶

func (receiver *RuneScanner) UnreadRune() error

type RuneWriter ¶

type RuneWriter struct {
	// contains filtered or unexported fields
}

RuneWriter writes a single UTF-8 encoded Unicode characters.

func WrapRuneWriter ¶

func WrapRuneWriter(writer io.Writer) RuneWriter

WrapRuneWriter wraps an io.Writer and returns a RuneWriter.

func (RuneWriter) WriteRune ¶

func (receiver RuneWriter) WriteRune(r rune) (int, error)

WriteRune writes a single UTF-8 encoded Unicode character and returns the number of bytes written.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL