utf8

package module

v2.0.1+incompatible Latest Latest Go to latest Published: Jul 19, 2022 License: MIT Imports: 3 Imported by: 5

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/reiver/go-utf8s

Links

Open Source Insights

README ¶

go-utf8

Package utf8 implements encoding and decoding of UTF-8, for the Go programming language.

This package is meant to be a replacement for Go's built-in "unicode/utf8" package.

Documention

Online documentation, which includes examples, can be found at: http://godoc.org/github.com/reiver/go-utf8

Reading a Single UTF-8 Character

This is the simplest way of reading a single UTF-8 character.

var reader io.Reader

// ...

r, n, err := utf8.ReadRune(reader)

Write a Single UTF-8 Character

This is the simplest way of writing a single UTF-8 character.

var writer io.Writer

// ...

var r rune

// ...

n, err := utf8.WriteRune(w, r)

io.RuneReader

This is how you can create an io.RuneReader:

var reader io.Reader

// ...

var runeReader io.RuneReader = utf8.RuneReaderWrap(reader)

// ...

r, n, err := runeReader.ReadRune()

io.RuneScanner

This is how you can create an io.RuneScanner:

var reader io.Reader

// ...

var runeScanner io.RuneScanner := utf8.RuneScannerWrap(reader)

// ...

r, n, err := runeScanner.ReadRune()

// ...

err = runeScanner.UnreadRune()

UTF-8

UTF-8 is a variable length encoding of Unicode. An encoding of a single Unicode code point can be from 1 to 4 bytes longs.

Some examples of UTF-8 encoding of Unicode code points are:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                    UTF-8 encoding                     ┃       ┃            ┃         ┃                            ┃                                  ┃
┣━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┫       ┃            ┃         ┃                            ┃                                  ┃
┃    byte 1   ┋    byte 2   ┋    byte 3   ┋    byte 4   ┃ value ┃ code point ┃ decimal ┃           binary           ┃               name               ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0b0,1000001 ┊             ┊             ┊             │   A   │     U+0041 │      65 │      0b0000,0000,0100,0001 │ LATIN CAPITAL LETTER A           │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b0,1110010 ┊             ┊             ┊             │   r   │     U+0072 │     114 │      0b0000,0000,0111,0010 │ LATIN SMALL LETTER R             │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b110,00010 ┊ 0b10,100001 ┊             ┊             │   ¡   │     U+00A1 │     161 │      0b0000,0000,1010,0001 │ INVERTED EXCLAMATION MARK        │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b110,11011 ┊ 0b10,110101 ┊             ┊             │   ۵   │     U+06F5 │    1781 │      0b0000,0110,1111,0101 │ EXTENDED ARABIC-INDIC DIGIT FIVE │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b1110,0010 ┊ 0b10,000000 ┊ 0b10,110001 ┊             │   ‱   │     U+2031 │    8241 │      0b0010,0000,0011,0001 │ PER TEN THOUSAND SIGN            │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b1110,0010 ┊ 0b10,001001 ┊ 0b10,100001 ┊             │   ≡   │     U+2261 │    8801 │      0b0010,0010,0110,0001 │ IDENTICAL TO                     │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b11110,000 ┊ 0b10,010000 ┊ 0b10,001111 ┊ 0b10,010101 │   𐏕   │ U+000103D5 │   66517 │ 0b0001,0000,0011,1101,0101 │ OLD PERSIAN NUMBER HUNDRED       │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b11110,000 ┊ 0b10,011111 ┊ 0b10,011001 ┊ 0b10,000010 │   🙂   │ U+0001F642 │  128578 │ 0b0001,1111,0110,0100,0010 │ SLIGHTLY SMILING FACE            │
└─────────────┴─────────────┴─────────────┴─────────────┴───────┴────────────┴─────────┴────────────────────────────┴──────────────────────────────────┘

UTF-8 Versus ASCII

UTF-8 was (partially) designed to be backwards compatible with 7-bit ASCII.

Thus, all 7-bit ASCII is valid UTF-8.

UTF-8 Encoding

Since, at least as of 2003, Unicode fits into 21 bits, and thus UTF-8 was designed to support at most 21 bits of information.

This is done as described in the following table:

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ # of bytes ┃ # bits for code point ┃ 1st code point ┃  last code point ┃  byte 1  ┃  byte 2  ┃  byte 3  ┃  byte 4  ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│     1      │            7          │    U+000000    │     U+00007F     │ 0xxxxxxx │          │          │          │
├────────────┼───────────────────────┼────────────────┼──────────────────┼──────────┼──────────┼──────────┼──────────┤
│     2      │           11          │    U+000080    │     U+0007FF     │ 110xxxxx │ 10xxxxxx │          │          │
├────────────┼───────────────────────┼────────────────┼──────────────────┼──────────┼──────────┼──────────┼──────────┤
│     3      │           16          │    U+000800    │     U+00FFFF     │ 1110xxxx │ 10xxxxxx │ 10xxxxxx │          │
├────────────┼───────────────────────┼────────────────┼──────────────────┼──────────┼──────────┼──────────┼──────────┤
│     4      │           21          │    U+010000    │     U+10FFFF     │ 11110xxx │ 10xxxxxx │ 10xxxxxx │ 10xxxxxx │
└────────────┴───────────────────────┴────────────────┴──────────────────┴──────────┴──────────┴──────────┴──────────┘

Documentation ¶

Index ¶

Constants
func FormatBinary(r rune) string
func ReadRune(reader io.Reader) (rune, int, error)
func RuneLength(r rune) int
func WriteRune(writer io.Writer, r rune) (int, error)
type InvalidUTF8Complainer
type NilReaderComplainer
type NilWriterComplainer
type RuneReader
- func RuneReaderWrap(reader io.Reader) RuneReader
- func (receiver *RuneReader) ReadRune() (rune, int, error)
type RuneScanner
- func RuneScannerWrap(reader io.Reader) RuneScanner
- func (receiver *RuneScanner) ReadRune() (rune, int, error)
- func (receiver *RuneScanner) UnreadRune() error
type RuneWriter
- func RuneWriterWrap(writer io.Writer) RuneWriter
- func (receiver *RuneWriter) WriteRune(r rune) (int, error)

Constants ¶

View Source

const (
	RuneError = '\uFFFD' // Unicode Replacement Character (U+FFFD).
)

Variables ¶

This section is empty.

Functions ¶

func FormatBinary ¶

func FormatBinary(r rune) string

FormatBinary returns a representation of a rune as a sequence of bytes, given in binary format.

Example

utf8.FormatBinary('۵')

// Outputs:
// <<0b11011011 ; 0b10110101>>

func ReadRune ¶

func ReadRune(reader io.Reader) (rune, int, error)

ReadRune reads a single UTF-8 encoded Unicode character from an io.Reader, and returns the Unicode character (as a Go rune) and the number of bytes read.

If ‘reader’ is nil then ReaderRune will return an error that matches utf8.NilReaderComplainer.

Example ¶

Here is an example usage of ReadRune:

     r, n, err := utf8.ReadRune(reader)
     if nil != err {

             switch err.(type) {
             case utf8.NilReaderComplainer:
                     //@TODO
             case utf8.InvalidUTF8Complainer:
                     //@TODO
             default:
                     //TODO
             }
     }
	if utf8.RuneError == r {
		//@TODO
	}

Number Of Bytes ¶

Note that a single UTF-8 encoded Unicode character could be more than one byte.

For example, the Unicode "≡" (IDENTICAL TO) character gets encoded using 3 bytes under UTF-8.

func RuneLength ¶

func RuneLength(r rune) int

RuneLength returns the number of bytes in a UTF-8 encoding of this Unicode code point.

Example

length := utf8.RuneLength('A')

// length == 1

Example

length := utf8.RuneLength('r')

// length == 1

Example

length := utf8.RuneLength('¡')

// length == 2

Example

length := utf8.RuneLength('۵')

// length == 2

func WriteRune ¶

func WriteRune(writer io.Writer, r rune) (int, error)

WriteRune writes a single UTF-8 encoded Unicode character and returns the number of bytes written.

If ‘writer’ is nil then WriteRune will return an error that matches utf8.NilWriterComplainer.

Example ¶

Here is an example usage of WriteRune:

n, err := utf8.WriteRune(writer, r)
if nil != err {

	switch err.(type) {
	case utf8.NilWriterComplainer:
		//@TODO
	default:
		//TODO
	}

}

Types ¶

type InvalidUTF8Complainer ¶

type InvalidUTF8Complainer interface {
	error
	InvalidUTF8Complainer()
}

InvalidUTF8Complainer is a type of error that could be returned by the utf8.ReadRune() function, by the utf8.RuneReader.ReadRune() method, and by the utf8.RuneScanner.ReadRune() method.

Here is how one might use this type:

r, n, err := utf8.ReadRune(reader)
if nil != err {
	switch {
	case utf8.InvalidUTF8Complainer:
		//@TODO
	default:
		//@TODO
	}
}

type NilReaderComplainer ¶

type NilReaderComplainer interface {
	error
	NilReaderComplainer()
}

type NilWriterComplainer ¶

type NilWriterComplainer interface {
	error
	NilWriterComplainer()
}

type RuneReader ¶

type RuneReader struct {
	// contains filtered or unexported fields
}

A utf8.RuneReader implements the io.RuneReader interface by reading from an io.Reader.

func RuneReaderWrap ¶

func RuneReaderWrap(reader io.Reader) RuneReader

func (*RuneReader) ReadRune ¶

func (receiver *RuneReader) ReadRune() (rune, int, error)

type RuneScanner ¶

type RuneScanner struct {
	// contains filtered or unexported fields
}

A utf8.RuneScanner implements the io.RuneScanner interface by reading from an io.Reader.

func RuneScannerWrap ¶

func RuneScannerWrap(reader io.Reader) RuneScanner

func (*RuneScanner) ReadRune ¶

func (receiver *RuneScanner) ReadRune() (rune, int, error)

func (*RuneScanner) UnreadRune ¶

func (receiver *RuneScanner) UnreadRune() error

type RuneWriter ¶

type RuneWriter struct {
	// contains filtered or unexported fields
}

RuneWriter writes a single UTF-8 encoded Unicode characters.

func RuneWriterWrap ¶

func RuneWriterWrap(writer io.Writer) RuneWriter

RuneWriterWrap wraps an io.Writer and returns a RuneWriter.

func (*RuneWriter) WriteRune ¶

func (receiver *RuneWriter) WriteRune(r rune) (int, error)

WriteRune writes a single UTF-8 encoded Unicode character and returns the number of bytes written.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL