isUTF8

command module
v0.0.0-...-bc4bfdd Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 24, 2021 License: MIT Imports: 5 Imported by: 0

README

isUTF8

Detect whether a file is well-formed UTF-8 or not.

isUTF8 is written in Go and uses memory mapped files to run as quickly as possible. It uses the golang.org/x/sys/unix package and will probably run only on Unix-like systems (e.g., MacOS, Linux). A portable and simpler but slower approach could use ordinary file I/O and utf8.Valid or utf8.ValidString.

On a 2016 MacBook Pro, isUTF8 checked a 1GB file in around 1 second, about 30% faster than a nearly identical C program compiled with gcc's ‑O3 flag (run times will vary depending on the system and how much of the file is already in memory cache).

For information about well-formed UTF-8 see The Unicode Standard, Chapter 3 Conformance, Table 3-7 Well-Formed UTF-8 Byte Sequences.

Prerequisites

Go programming language.

golang.org/x/sys/unix package. Not part of the standard Go installation so it must be installed separately.

go get golang.org/x/sys/unix

Building

git clone https://github.com/mfuhr/isUTF8.git
cd isUTF8
go test
go build

To install under $GOPATH/bin:

go install

To see test coverage:

go test -coverprofile=coverage.out
go tool cover -func=coverage.out
go tool cover -html=coverage.out

Examples

$ ./isUTF8 testdata/test_utf8.txt
true testdata/test_utf8.txt
$ echo $?
0
$ ./isUTF8 testdata/test_latin1.txt
false testdata/test_latin1.txt
$ echo $?
1

Status

In active development (June 2017). Behavior, especially the output, subject to change.

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL