charsetx

package module
v0.0.0-...-58a84c9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 15, 2017 License: MIT Imports: 12 Imported by: 0

README

charsetx

GoDoc Go Report Card Build Status

charsetx detects charset encoding of an HTML document, and convert a non-UTF8 page body to UTF8 string.

There are 3 steps for charset detection:

  1. Return the result of charset.DetermineEncoding() if certain is true.
  2. Return the result of chardet.Detector.DetectBest() if Confidence is 100.
  3. Return charset in Content-Type meta tag if exists.

If all 3 steps fails, it returns error.

Install

go get -u github.com/philipjkim/charsetx

Example

Getting UTF-8 string of body for given URL
// Invalid UTF-8 characters are discarded to give a result 
// rather than giving error 
// if the second bool param is set to true.
r, err := charsetx.GetUTF8BodyFromURL("http://www.godoc.org", false)
if err != nil {
    fmt.Println(err)
    return
}

fmt.Println(r)

If you want to reuse or customize *http.Client instead of http.DefaultClient, use GetUTF8Body().

Getting the charset of given URL
client := http.DefaultClient
resp, err := client.Get("http://www.godoc.org")
if err != nil {
    fmt.Println(err)
    return
}

defer resp.Body.Close()
byt, err := ioutil.ReadAll(resp.Body)
if err != nil {
    fmt.Println(err)
    return
}

cs, err := charsetx.DetectCharset(byt, resp.Header.Get("Content-Type"))
if err != nil {
    fmt.Println(err)
    return
}

fmt.Println(cs)

License

MIT

Documentation

Overview

Package charsetx provides functions for detecting charset encoding of an HTML document and UTF-8 conversion.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func DetectCharset

func DetectCharset(body []byte, contentType string) (string, error)

DetectCharset returns charset for contents of urlStr by 3 steps:

  1. Use the result of charset.DetermineEncoding() if certain is true.
  2. Use the result of chardet.Detector.DetectBest() if Confidence is 100.
  3. Use charset in `Content-Type` meta tag if exists.

If no charset is detected by 3 steps, it returns error.

Argument body is response body as byte[], and contentType is `Content-Type` response header value.

Example
package main

import (
	"fmt"
	"io/ioutil"
	"net/http"

	"github.com/philipjkim/charsetx"
)

func main() {
	client := http.DefaultClient
	resp, err := client.Get("http://www.godoc.org")
	if err != nil {
		fmt.Println(err)
		return
	}

	defer resp.Body.Close()
	byt, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		fmt.Println(err)
		return
	}

	cs, err := charsetx.DetectCharset(byt, resp.Header.Get("Content-Type"))
	if err != nil {
		fmt.Println(err)
		return
	}

	fmt.Println(cs)
}
Output:

func GetUTF8Body

func GetUTF8Body(body []byte, contentType string,
	ignoreInvalidUTF8Chars bool) (string, error)

GetUTF8Body returns string of body. If charset for body is detected as non-UTF8, this function converts to UTF-8 string and return it. Illegal byte sequences are silently discarded if ignoreInvalidUTF8Chars is set to true.

Example
package main

import (
	"fmt"
	"io/ioutil"
	"net/http"

	"github.com/philipjkim/charsetx"
)

func main() {
	client := http.DefaultClient
	resp, err := client.Get("https://golang.org/")
	if err != nil {
		fmt.Println(err)
		return
	}

	defer resp.Body.Close()
	body, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		fmt.Println(err)
		return
	}

	r, err := charsetx.GetUTF8Body(body, resp.Header.Get("Content-Type"), false)
	if err != nil {
		fmt.Println(err)
		return
	}

	fmt.Println(r)
}
Output:

func GetUTF8BodyFromURL

func GetUTF8BodyFromURL(urlStr string, ignoreIBS bool) (string, error)

GetUTF8BodyFromURL returns response body of urlStr as string. It converts charset to UTF-8 if original charset is non UTF-8. Illegal byte sequences are silently discarded if ignoreIBS is set to true.

Example
package main

import (
	"fmt"

	"github.com/philipjkim/charsetx"
)

func main() {
	r, err := charsetx.GetUTF8BodyFromURL("http://www.godoc.org", false)
	if err != nil {
		fmt.Println(err)
		return
	}

	fmt.Println(r)
}
Output:

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL