extract

package module

v0.0.0-...-1eaae7b Latest Latest Go to latest Published: Mar 24, 2016 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/suzuken/extract

Links

Open Source Insights

README ¶

Extract

Extract is HTML Extractor. This extractor is based on wedata.

Acknowledgement

items.json is originally from http://wedata.net/databases/LDRFullFeed/items.json.
Currently, Extract only works for URLs which in wedata.

How to use

From _example,

package main

import (
	"flag"
	"fmt"
	"log"
	"os"

	"github.com/suzuken/extract"
)

func main() {
	var (
		rawurl = flag.String("url", "http://example.com", "url for extract")
	)
	flag.Parse()
	ex := extract.New()
	if rule := ex.Match(*rawurl); rule == nil {
		log.Printf("%s doesn't match in rule", *rawurl)
		os.Exit(0)
	}
	c, err := ex.ExtractURL(*rawurl)
	if err != nil {
		log.Fatalf("extract failed: %s", err)
	}
	fmt.Printf("content: %v", c)
}

LICENSE

MIT

All data in wedata are in the public domain. see also: http://wedata.net/help/about .

Special Thanks

Wedata project and members.

Author

Kenta Suzuki (a.k.a. suzuken)

Documentation ¶

Index ¶

type Content
type Extractor
- func New() *Extractor
type IExtractor
type Rule
- func (r *Rule) URL() string
- func (r *Rule) XPath() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Content ¶

type Content struct {
	Title       string
	Description string
	Text        string // Text is extracted body.
}

Content is extracted contents.

type Extractor ¶

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor is actual extractor

func New ¶

func New() *Extractor

New creates extrator

func (*Extractor) Dump ¶

func (e *Extractor) Dump() []Rule

Dump returns rules

func (*Extractor) Extract ¶

func (e *Extractor) Extract(r io.Reader, rawurl string) (*Content, error)

Extract load body from io.Reader and extract contents by rule. If URL is not match wedata rule, skip it and return nil. io.Reader should be UTF-8 byte stream.

func (*Extractor) ExtractURL ¶

func (e *Extractor) ExtractURL(rawurl string) (*Content, error)

ExtractURL fetch contents from URL, and parse it, and extract.

func (*Extractor) Init ¶

func (e *Extractor) Init()

Init initialize extractor.

func (*Extractor) Load ¶

func (e *Extractor) Load(r io.Reader) error

Load loads rule from reader and extract rules.

func (*Extractor) LoadFile ¶

func (e *Extractor) LoadFile(path string) error

LoadFile load LDRFullFeed JSON from filepath

func (*Extractor) Match ¶

func (e *Extractor) Match(rawurl string) *Rule

Match returns if provided URL is match wedata rules.

TODO: use prefix match for faster matching.

type IExtractor ¶

type IExtractor interface {
	Extract(r io.Reader, rawurl string)
}

IExtractor is interface of extractor.

type Rule ¶

type Rule struct {
	ResourceURL         string    `json:"resource_url"`
	Name                string    `json:"name"`
	CreatedBy           string    `json:"created_by"`
	DatabaseResourceURL string    `json:"database_resource_url"`
	UpdatedAt           time.Time `json:"updated_at"`
	CreatedAt           time.Time `json:"created_at"`
	Data                struct {
		URL          string `json:"url"` // URL is the regex pattern of target pages
		Type         string `json:"type"`
		Enc          string `json:"enc"` // Enc is encoding of the page contents
		XPath        string `json:"xpath"`
		Base         string `json:"base"`
		MicroFormats string `json:"microformats"`
	}
	// contains filtered or unexported fields
}

Rule is

func (*Rule) URL ¶

func (r *Rule) URL() string

URL get raw URL from rule. URL is raw regular expression string.

func (*Rule) XPath ¶

func (r *Rule) XPath() string

XPath get XPath from rule.

Source Files ¶

View all Source files

extract.go

Directories ¶

Path	Synopsis
_example

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL