easychars

package module
v0.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 13, 2023 License: MIT Imports: 11 Imported by: 0

README

easychars

Based on saintfish/chardet and golang.org/x/text/encoding/ , easychars makes it convient to detect the charset and convert content to UTF-8 encoded.

Support charset

  • Unicode: UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE

  • Simplified Chinese: GB2312, GBK, GB18030(include GB2312 and GBK)

  • Tranditional Chinese: Big5, EUC-TW

  • Janpanese: EUC-JP, Shift_JIS, ISO-2022-JP

  • Korean: EUC-KR, ISO-2022-KR

  • Russian:

  • Others: ISO-8859-1, ISO-8859-2, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-9, Windows-1250, Windows-1251, Windows-1254, Windows-1255, Windows-1256 ...

For other charsets, try easychars.ToUtf8WithCharsetName to test whether it's supported

Example

package main

import (
    "fmt"
    "github.com/HeapStackTree/easychars"
    "os"
)

func ReadAndConvertFile(path string, charsetName string) (contentInUtf8 []byte, res *charset.Result, err error) {
    res = &charset.Result{
        Charset:     "unknown",
        Language:    "unknown",
        Confidence:  0,
        Convertible: false,
    }

    content, err := os.ReadFile(path)
    if err != nil {
        return
    }
    if charsetName == "" {
        contentInUtf8, res, err = easychars.DetectAndConvertToUtf8(content)
    } else {
        contentInUtf8, err = easychars.ToUtf8WithCharsetName(content, charsetName)
        if err == nil {
            res.Charset = charsetName
            res.Confidence = 100
            res.Convertible = true
        }
    }
    return
}

func main() {
    path := "tests/GB2312/_mozilla_bug171813_text.html"

    // use charset name if you are sure about it
    content, res, err := ReadAndConvertFile(path, "")
    if err != nil {
        return
    }

    // jump ascii parts
    var gbkLoc int
    for i, v := range content {
        if v >= 0x7F {
            gbkLoc = i
            break
        }
    }

    fmt.Printf("Path: %s\nCharset: %s\nLanguage: %s\nConfidence: %d\nConvetible: %t\nContent: %s\n", path, res.Charset, res.Language, res.Confidence, res.Convertible, content[gbkLoc:])
    // Ouput should be:
    // Charset: GB-18030
    // Language: zh
    // Confidence: 100
    // Convetible: true
    // Content: 搜狐在线</b></font></a></div> ...
}

Check godoc for other methods.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetEncodingFromCharsetName

func GetEncodingFromCharsetName(name string) (e encoding.Encoding, err error)

GetEncodingFromCharsetName return encoding.Encoding for given charset name (case insensitive).

It will return errInvalidName if the package can't find correspond encoding.Encoding.

Charset name reference:

https://encoding.spec.whatwg.org/#names-and-labels

http://www.iana.org/assignments/character-sets/character-sets.xhtml

func IsValidUTF8

func IsValidUTF8(content []byte) bool

Check whether content is valid under UTF-8 rule

func ToUtf8WithCharsetName

func ToUtf8WithCharsetName(content []byte, charsetName string) ([]byte, error)

Get UTF-8 encoded []byte with charset name.

It will return errInvalidName if there is charset name is not valid

or errWrongDecoder if content can't decoded by the correspond Decoder

Charset name reference:

https://encoding.spec.whatwg.org/#names-and-labels

http://www.iana.org/assignments/character-sets/character-sets.xhtml

func ToUtf8WithDecoder

func ToUtf8WithDecoder(content []byte, d Decoder) ([]byte, error)

Get UTF-8 encoded []byte with Decoder.

func ToUtf8WithEncoding

func ToUtf8WithEncoding(content []byte, e encoding.Encoding) ([]byte, error)

Get UTF-8 encoded []byte with encoding.Encoding.

Types

type Decoder

type Decoder interface {
	transform.Transformer
}

alias for transform.Transformer

func GetDecoderFromCharsetName

func GetDecoderFromCharsetName(charsetName string) (decoder Decoder, err error)

GetDecoderFromCharsetName return Decoder for given charset name (case insensitive).

It will return errInvalidName if the package can't find correspond Decoder.

Reference: http://www.iana.org/assignments/character-sets/character-sets.xhtml

and http://www.iana.org/assignments/character-sets/character-sets.xhtml.

type Result

type Result struct {
	// IANA name of the detected charset.
	Charset string
	// IANA name of the detected language. It may be empty for some charsets.
	Language string
	// Confidence of the Result. Scale from 1 to 100. The bigger, the more confident.
	Confidence int
	// a Decoder which can convert the Result.Charset to utf-8, default encoding.Nop.NewDecoder() which won't try to convert the charset.
	Decoder transform.Transformer
	// Whether the charset can be converted by this package
	Convertible bool
}

Result contains all the information that charset detecfr gives.

func DetectAll

func DetectAll(content []byte) (results []*Result, err error)

DetectAll returns all chardet.Results which have non-zero Confidence. The Results are sorted by Confidence in descending order.

Same as saintfish/chardet - chardet.NewTextDetector().DetectAll() but save matched Decoder in result

func DetectAndConvertToUtf8

func DetectAndConvertToUtf8(content []byte) (convertedContent []byte, res *Result, err error)

Detect and convert content to UTF-8 encoded.

func DetectEncoding

func DetectEncoding(content []byte) (result *Result, err error)

DetectEncoding return the Result with highest Confidence.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL