utfutil

package module
v0.0.0-...-5d1025d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 7, 2019 License: BSD-3-Clause Imports: 8 Imported by: 1

README

utfutil

FOSSA Status

Utilities to make it easier to read text encoded as UTF-16.

Dealing with UTF-16 files you receive from Windows.

Have you encountered this situation? Code that has worked for years suddenly breaks. It turns out someone tried to use it with a file that came from a MS-Windows system. Now this perfectly good code stops working. Looking at a hex dump you realize every other byte is \0. WTF? No, UTF. More specifically UTF-16LE with an optional BOM.

What does all that mean? Well, first you should read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky.

Now you understand what the problem is, but how do you fix it? Well, you can spend a week trying to figure out how to use golang.org/x/text/encoding/unicode and you'll be able to decode UTF-16LE files. (No offense to the authors of that module. It is a fantastic module but if you aren't already an expert in Unicode encoding, it is pretty difficult to use.)

If you don't have a week, you can just use this module. Take the easy way out! Just change ioutil.ReadFile() to utfutil.ReadFile(). Everything will just work.

The goal of utfutl is to provide replacement functions that magically do the right thing. There is a demo program that shows how to use it called catutf.

utfutil.ReadFile() is the equivalent of ioutil.ReadFile()

OLD: Works with UTF8 and ASCII files:

		data, err := ioutil.ReadFile(filename)

NEW: Works if someone gives you a Windows UTF-16LE file occasionally but normally you are processing UTF8 files:

		data, err := utfutil.ReadFile(filename, utfutil.UTF8)
utfutil.OpenFile() is the equivalent of os.Open().

OLD: Works with UTF8 and ASCII files:

		data, err := os.Open(filename)

NEW: Works if someone gives you a file with a BOM:

		data, err := utfutil.OpenFile(filename, utfutil.HTML5)
utfutil.NewScanner() is for reading files line-by-line

It works like os.Open():

		s, err := utfutil.NewScanner(filename, utfutil.HTML5)

Encoding hints:

What's that second argument all about? utfutil.UTF8? utfutil.HTML5?

If a file has no BOM, it is impossible to guess the file encoding with 100% accuracy. Therefore, the 2nd parameter is an "EncodingHint" that specifies what to assume for BOM-less files.

UTF8        No BOM?  Assume UTF-8
UTF16LE     No BOM?  Assume UTF 16 Little Endian
UTF16BE     No BOM?  Assume UTF 16 Big Endian
WINDOWS = UTF16LE   (i.e. a reasonable guess if file is from MS-Windows)
POSIX   = UTF8      (i.e. a reasonable guess if file is from Unix or Unix-like systems)
HTML5   = UTF8      (i.e. a reasonable guess if file is from the web)

Future Directions

If someone writes a golang equivalent of uchatdet, I'll add a hint called "AUTO" which uses it. That would be awesome. Volunteers?

License

FOSSA Status

Documentation

Overview

Package utfutil provides methods that make it easy to read data in an UTF-encoding agnostic.

Index

Constants

View Source
const (
	// UTF8 indicates the specified encoding.
	UTF8 EncodingHint = iota
	// UTF16LE indicates the specified encoding.
	UTF16LE
	// UTF16BE indicates the specified encoding.
	UTF16BE
	// WINDOWS indicates that the file came from a MS-Windows system
	WINDOWS = UTF16LE
	// POSIX indicates that the file came from Unix or Unix-like systems
	POSIX = UTF8
	// HTML5 indicates that the file came from the web
	HTML5 = UTF8
)

Variables

This section is empty.

Functions

func BytesReader

func BytesReader(b []byte, d EncodingHint) io.Reader

BytesReader is a convenience function that takes a []byte and decodes them to UTF-8.

func ReadFile

func ReadFile(name string, d EncodingHint) ([]byte, error)

ReadFile is the equivalent of ioutil.ReadFile()

Types

type EncodingHint

type EncodingHint int

EncodingHint indicates the file's encoding if there is no BOM.

type UTFReadCloser

type UTFReadCloser interface {
	Read(p []byte) (n int, err error)
	Close() error
}

UTFReadCloser describes the utfutil ReadCloser structure.

func NewReader

func NewReader(r io.Reader, d EncodingHint) UTFReadCloser

NewReader wraps a Reader to decode Unicode to UTF-8 as it reads.

func OpenFile

func OpenFile(name string, d EncodingHint) (UTFReadCloser, error)

OpenFile is the equivalent of os.Open().

type UTFScanCloser

type UTFScanCloser interface {
	Buffer(buf []byte, max int)
	Bytes() []byte
	Err() error
	Scan() bool
	Split(split bufio.SplitFunc)
	Text() string
	Close() error
}

UTFScanCloser describes a new utfutil ScanCloser structure. It's similar to ReadCloser, but with a scanner instead of a reader.

func NewScanner

func NewScanner(name string, d EncodingHint) (UTFScanCloser, error)

NewScanner is a convenience function that takes a filename and returns a scanner.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL