encoding

package
v0.0.0-...-d3199ed Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 18, 2020 License: Apache-2.0 Imports: 10 Imported by: 0

README

A package for dealing with encodings

Rearranged from this package.

Documentation

Index

Constants

View Source
const (
	FallbackFail  = iota // FallbackFail behavior causes GetEncoding to fail when it cannot find an encoding.
	FallbackASCII        // FallbackASCII behavior causes GetEncoding to fall back to a 7-bit ASCII encoding, if no other encoding can be found.
	FallbackUTF8         // FallbackUTF8 behavior causes GetEncoding to assume UTF8 can pass unmodified upon failure. Note that this behavior is not recommended, unless you are sure your terminal can cope  with real UTF8 sequences.
)
View Source
const (
	Sterling = '£'
	DArrow   = '↓'
	LArrow   = '←'
	RArrow   = '→'
	UArrow   = '↑'
	Bullet   = '·'
	Board    = '░'
	CkBoard  = '▒'
	Degree   = '°'
	Diamond  = '◆'
	GEqual   = '≥'
	Pi       = 'π'
	HLine    = '─'
	Lantern  = '§'
	Plus     = '┼'
	LEqual   = '≤'
	LLCorner = '└'
	LRCorner = '┘'
	NEqual   = '≠'
	PlMinus  = '±'
	S1       = '⎺'
	S3       = '⎻'
	S7       = '⎼'
	S9       = '⎽'
	Block    = '█'
	TTee     = '┬'
	RTee     = '┤'
	LTee     = '├'
	BTee     = '┴'
	ULCorner = '┌'
	URCorner = '┐'
	VLine    = '│'
	Space    = ' '
)

The names of these constants are chosen to match Terminfo names, modulo case, and changing the prefix from ACS_ to Rune. These are the runes we provide extra special handling for, with ASCII fallbacks for terminals that lack them.

Variables

This section is empty.

Functions

func GetEncoding

func GetEncoding(charset string) encoding.Encoding

GetEncoding is used by Screen implementors who want to locate an encoding for the given character set name. Note that this will return nil for either the Unicode (UTF-8) or ASCII encodings, since we don't use encodings for them but instead have our own native methods.

func Register

func Register()

Register registers all known encodings. This is a short-cut to add full character set support to your program. Note that this can add several megabytes to your program's size, because some of the encodings are rather large (particularly those from East Asia.)

func RegisterEncoding

func RegisterEncoding(charset string, enc encoding.Encoding)

RegisterEncoding may be called by the application to register an encoding. The presence of additional encodings will facilitate application usage with terminal environments where the I/O subsystem does not support Unicode. Windows systems use Unicode natively, and do not need any of the encoding subsystem when using Windows Console screens.

Please see the Go documentation for golang.org/x/text/encoding -- most of the common ones exist already as stock variables. For example, ISO8859-15 can be registered using the following code:

import "golang.org/x/text/encoding/charmap"
...
RegisterEncoding("ISO8859-15", charmap.ISO8859_15)

Aliases can be registered as well, for example "8859-15" could be an alias for "ISO8859-15".

For POSIX systems, the term package will check the environment variables LC_ALL, LC_CTYPE, and LANG (in that order) to determine the character set. These are expected to have the following pattern:

$language[.$codeset[@$variant]

We extract only the $codeset part, which will usually be something like UTF-8 or ISO8859-15 or KOI8-R. Note that if the locale is either "POSIX" or "C", then we assume US-ASCII (the POSIX 'portable character set' and assume all other characters are somehow invalid.)

Modern POSIX systems and terminal emulators may use UTF-8, and for those systems, this API is also unnecessary. For example, Darwin (MacOS X) and modern Linux running modern xterm generally will out of the box without any of this. Use of UTF-8 is recommended when possible, as it saves quite a lot processing overhead.

Note that some encodings are quite large (for example GB18030 which is a superset of Unicode) and so the application size can be expected to increase quite a bit as each encoding is added. The East Asian encodings have been seen to add 100-200K per encoding to the application size.

func SetEncodingFallback

func SetEncodingFallback(fb Fallback)

SetEncodingFallback changes the behavior of GetEncoding when a suitable encoding is not found. The default is FallbackFail, which causes GetEncoding to simply return nil.

Types

type CharMap

type CharMap struct {
	transform.NopResetter

	// The map between bytes and runes.  To indicate that a specific
	// byte value is invalid for a character set, use the rune
	// utf8.RuneError.  Values that are absent from this map will
	// be assumed to have the identity mapping -- that is the default
	// is to assume ISO8859-1, where all 8-bit characters have the same
	// numeric value as their Unicode runes.  (Not to be confused with
	// the UTF-8 values, which *will* be different for non-ASCII runes.)
	//
	// If no values less than RuneSelf are changed (or have non-identity
	// mappings), then the character set is assumed to be an ASCII
	// superset, and certain assumptions and optimizations become
	// available for ASCII bytes.
	Map map[byte]rune

	// The ReplacementChar is the byte value to use for substitution.
	// It should normally be ASCIISub for ASCII encodings.  This may be
	// unset (left to zero) for mappings that are strictly ASCII supersets.
	// In that case ASCIISub will be assumed instead.
	ReplacementChar byte
	// contains filtered or unexported fields
}

CharMap is a structure for setting up encodings for 8-bit character sets, for transforming between UTF8 and that other character set. It has some ideas borrowed from golang.org/x/text/encoding/charmap, but it uses a different implementation. This implementation uses maps, and supports user-defined maps.

We do assume that a character map has a reasonable substitution character, and that valid encodings are stable (exactly a 1:1 map) and stateless (that is there is no shift character or anything like that.) Hence this approach will not work for many East Asian character sets.

Measurement shows little or no measurable difference in the performance of the two approaches. The difference was down to a couple of nsec/op, and no consistent pattern as to which ran faster. With the conversion to UTF-8 the code takes about 25 nsec/op. The conversion in the reverse direction takes about 100 nsec/op. The larger cost for conversion from UTF-8 is most likely due to the need to convert the UTF-8 byte stream to a rune before conversion.

func (*CharMap) Init

func (c *CharMap) Init()

Init initializes internal values of a character map. This should be done early, to minimize the cost of allocation of transforms later. It is not strictly necessary however, as the allocation functions will arrange to call it if it has not already been done.

func (*CharMap) NewDecoder

func (c *CharMap) NewDecoder() *encoding.Decoder

NewDecoder returns a Decoder the converts from the 8-bit character set to UTF-8. Unknown mappings, if any, are mapped to '\uFFFD'.

func (*CharMap) NewEncoder

func (c *CharMap) NewEncoder() *encoding.Encoder

NewEncoder returns a Transformer that converts from UTF8 to the 8-bit character set. Unknown mappings are mapped to 0x1A.

type Fallback

type Fallback int

Fallback describes how the system behaves when the locale requires a character set that we do not support. The system always supports UTF-8 and US-ASCII. On Windows consoles, UTF-16LE is also supported automatically. Other character sets must be added using the RegisterEncoding API. (A large group of nearly all of them can be added using the RegisterAll function in the encoding sub package.)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL