utf8

package module
v2.0.1+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 19, 2022 License: MIT Imports: 3 Imported by: 5

README

go-utf8

Package utf8 implements encoding and decoding of UTF-8, for the Go programming language.

This package is meant to be a replacement for Go's built-in "unicode/utf8" package.

Documention

Online documentation, which includes examples, can be found at: http://godoc.org/github.com/reiver/go-utf8

GoDoc

Reading a Single UTF-8 Character

This is the simplest way of reading a single UTF-8 character.

var reader io.Reader

// ...

r, n, err := utf8.ReadRune(reader)

Write a Single UTF-8 Character

This is the simplest way of writing a single UTF-8 character.

var writer io.Writer

// ...

var r rune

// ...

n, err := utf8.WriteRune(w, r)

io.RuneReader

This is how you can create an io.RuneReader:

var reader io.Reader

// ...

var runeReader io.RuneReader = utf8.RuneReaderWrap(reader)

// ...

r, n, err := runeReader.ReadRune()

io.RuneScanner

This is how you can create an io.RuneScanner:

var reader io.Reader

// ...

var runeScanner io.RuneScanner := utf8.RuneScannerWrap(reader)

// ...

r, n, err := runeScanner.ReadRune()

// ...

err = runeScanner.UnreadRune()

UTF-8

UTF-8 is a variable length encoding of Unicode. An encoding of a single Unicode code point can be from 1 to 4 bytes longs.

Some examples of UTF-8 encoding of Unicode code points are:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                    UTF-8 encoding                     ┃       ┃            ┃         ┃                            ┃                                  ┃
┣━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┫       ┃            ┃         ┃                            ┃                                  ┃
┃    byte 1   ┋    byte 2   ┋    byte 3   ┋    byte 4   ┃ value ┃ code point ┃ decimal ┃           binary           ┃               name               ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0b0,1000001 ┊             ┊             ┊             │   A   │     U+0041 │      65 │      0b0000,0000,0100,0001 │ LATIN CAPITAL LETTER A           │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b0,1110010 ┊             ┊             ┊             │   r   │     U+0072 │     114 │      0b0000,0000,0111,0010 │ LATIN SMALL LETTER R             │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b110,00010 ┊ 0b10,100001 ┊             ┊             │   ¡   │     U+00A1 │     161 │      0b0000,0000,1010,0001 │ INVERTED EXCLAMATION MARK        │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b110,11011 ┊ 0b10,110101 ┊             ┊             │   ۵   │     U+06F5 │    1781 │      0b0000,0110,1111,0101 │ EXTENDED ARABIC-INDIC DIGIT FIVE │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b1110,0010 ┊ 0b10,000000 ┊ 0b10,110001 ┊             │   ‱   │     U+2031 │    8241 │      0b0010,0000,0011,0001 │ PER TEN THOUSAND SIGN            │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b1110,0010 ┊ 0b10,001001 ┊ 0b10,100001 ┊             │   ≡   │     U+2261 │    8801 │      0b0010,0010,0110,0001 │ IDENTICAL TO                     │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b11110,000 ┊ 0b10,010000 ┊ 0b10,001111 ┊ 0b10,010101 │   𐏕   │ U+000103D5 │   66517 │ 0b0001,0000,0011,1101,0101 │ OLD PERSIAN NUMBER HUNDRED       │
├─────────────┼─────────────┼─────────────┼─────────────┼───────┼────────────┼─────────┼────────────────────────────┼──────────────────────────────────┤
│ 0b11110,000 ┊ 0b10,011111 ┊ 0b10,011001 ┊ 0b10,000010 │   🙂   │ U+0001F642 │  128578 │ 0b0001,1111,0110,0100,0010 │ SLIGHTLY SMILING FACE            │
└─────────────┴─────────────┴─────────────┴─────────────┴───────┴────────────┴─────────┴────────────────────────────┴──────────────────────────────────┘

UTF-8 Versus ASCII

UTF-8 was (partially) designed to be backwards compatible with 7-bit ASCII.

Thus, all 7-bit ASCII is valid UTF-8.

UTF-8 Encoding

Since, at least as of 2003, Unicode fits into 21 bits, and thus UTF-8 was designed to support at most 21 bits of information.

This is done as described in the following table:

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ # of bytes ┃ # bits for code point ┃ 1st code point ┃  last code point ┃  byte 1  ┃  byte 2  ┃  byte 3  ┃  byte 4  ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│     1      │            7          │    U+000000    │     U+00007F     │ 0xxxxxxx │          │          │          │
├────────────┼───────────────────────┼────────────────┼──────────────────┼──────────┼──────────┼──────────┼──────────┤
│     2      │           11          │    U+000080    │     U+0007FF     │ 110xxxxx │ 10xxxxxx │          │          │
├────────────┼───────────────────────┼────────────────┼──────────────────┼──────────┼──────────┼──────────┼──────────┤
│     3      │           16          │    U+000800    │     U+00FFFF     │ 1110xxxx │ 10xxxxxx │ 10xxxxxx │          │
├────────────┼───────────────────────┼────────────────┼──────────────────┼──────────┼──────────┼──────────┼──────────┤
│     4      │           21          │    U+010000    │     U+10FFFF     │ 11110xxx │ 10xxxxxx │ 10xxxxxx │ 10xxxxxx │
└────────────┴───────────────────────┴────────────────┴──────────────────┴──────────┴──────────┴──────────┴──────────┘

Documentation

Index

Constants

View Source
const (
	RuneError = '\uFFFD' // Unicode Replacement Character (U+FFFD).
)

Variables

This section is empty.

Functions

func FormatBinary

func FormatBinary(r rune) string

FormatBinary returns a representation of a rune as a sequence of bytes, given in binary format.

Example

utf8.FormatBinary('۵')

// Outputs:
// <<0b11011011 ; 0b10110101>>

func ReadRune

func ReadRune(reader io.Reader) (rune, int, error)

ReadRune reads a single UTF-8 encoded Unicode character from an io.Reader, and returns the Unicode character (as a Go rune) and the number of bytes read.

If ‘reader’ is nil then ReaderRune will return an error that matches utf8.NilReaderComplainer.

Example

Here is an example usage of ReadRune:

     r, n, err := utf8.ReadRune(reader)
     if nil != err {

             switch err.(type) {
             case utf8.NilReaderComplainer:
                     //@TODO
             case utf8.InvalidUTF8Complainer:
                     //@TODO
             default:
                     //TODO
             }
     }
	if utf8.RuneError == r {
		//@TODO
	}

Number Of Bytes

Note that a single UTF-8 encoded Unicode character could be more than one byte.

For example, the Unicode "≡" (IDENTICAL TO) character gets encoded using 3 bytes under UTF-8.

func RuneLength

func RuneLength(r rune) int

RuneLength returns the number of bytes in a UTF-8 encoding of this Unicode code point.

Example

length := utf8.RuneLength('A')

// length == 1

Example

length := utf8.RuneLength('r')

// length == 1

Example

length := utf8.RuneLength('¡')

// length == 2

Example

length := utf8.RuneLength('۵')

// length == 2

func WriteRune

func WriteRune(writer io.Writer, r rune) (int, error)

WriteRune writes a single UTF-8 encoded Unicode character and returns the number of bytes written.

If ‘writer’ is nil then WriteRune will return an error that matches utf8.NilWriterComplainer.

Example

Here is an example usage of WriteRune:

n, err := utf8.WriteRune(writer, r)
if nil != err {

	switch err.(type) {
	case utf8.NilWriterComplainer:
		//@TODO
	default:
		//TODO
	}

}

Types

type InvalidUTF8Complainer

type InvalidUTF8Complainer interface {
	error
	InvalidUTF8Complainer()
}

InvalidUTF8Complainer is a type of error that could be returned by the utf8.ReadRune() function, by the utf8.RuneReader.ReadRune() method, and by the utf8.RuneScanner.ReadRune() method.

Here is how one might use this type:

r, n, err := utf8.ReadRune(reader)
if nil != err {
	switch {
	case utf8.InvalidUTF8Complainer:
		//@TODO
	default:
		//@TODO
	}
}

type NilReaderComplainer

type NilReaderComplainer interface {
	error
	NilReaderComplainer()
}

type NilWriterComplainer

type NilWriterComplainer interface {
	error
	NilWriterComplainer()
}

type RuneReader

type RuneReader struct {
	// contains filtered or unexported fields
}

A utf8.RuneReader implements the io.RuneReader interface by reading from an io.Reader.

func RuneReaderWrap

func RuneReaderWrap(reader io.Reader) RuneReader

func (*RuneReader) ReadRune

func (receiver *RuneReader) ReadRune() (rune, int, error)

type RuneScanner

type RuneScanner struct {
	// contains filtered or unexported fields
}

A utf8.RuneScanner implements the io.RuneScanner interface by reading from an io.Reader.

func RuneScannerWrap

func RuneScannerWrap(reader io.Reader) RuneScanner

func (*RuneScanner) ReadRune

func (receiver *RuneScanner) ReadRune() (rune, int, error)

func (*RuneScanner) UnreadRune

func (receiver *RuneScanner) UnreadRune() error

type RuneWriter

type RuneWriter struct {
	// contains filtered or unexported fields
}

RuneWriter writes a single UTF-8 encoded Unicode characters.

func RuneWriterWrap

func RuneWriterWrap(writer io.Writer) RuneWriter

RuneWriterWrap wraps an io.Writer and returns a RuneWriter.

func (*RuneWriter) WriteRune

func (receiver *RuneWriter) WriteRune(r rune) (int, error)

WriteRune writes a single UTF-8 encoded Unicode character and returns the number of bytes written.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL