extract

package module
v0.0.0-...-1eaae7b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 24, 2016 License: MIT Imports: 11 Imported by: 0

README

Extract

Build Status Go Report Card

Extract is HTML Extractor. This extractor is based on wedata.

Acknowledgement

  • items.json is originally from http://wedata.net/databases/LDRFullFeed/items.json.
  • Currently, Extract only works for URLs which in wedata.

How to use

From _example,

package main

import (
	"flag"
	"fmt"
	"log"
	"os"

	"github.com/suzuken/extract"
)

func main() {
	var (
		rawurl = flag.String("url", "http://example.com", "url for extract")
	)
	flag.Parse()
	ex := extract.New()
	if rule := ex.Match(*rawurl); rule == nil {
		log.Printf("%s doesn't match in rule", *rawurl)
		os.Exit(0)
	}
	c, err := ex.ExtractURL(*rawurl)
	if err != nil {
		log.Fatalf("extract failed: %s", err)
	}
	fmt.Printf("content: %v", c)
}

LICENSE

MIT

All data in wedata are in the public domain. see also: http://wedata.net/help/about .

Special Thanks

  • Wedata project and members.

Author

Kenta Suzuki (a.k.a. suzuken)

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Content

type Content struct {
	Title       string
	Description string
	Text        string // Text is extracted body.
}

Content is extracted contents.

type Extractor

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor is actual extractor

func New

func New() *Extractor

New creates extrator

func (*Extractor) Dump

func (e *Extractor) Dump() []Rule

Dump returns rules

func (*Extractor) Extract

func (e *Extractor) Extract(r io.Reader, rawurl string) (*Content, error)

Extract load body from io.Reader and extract contents by rule. If URL is not match wedata rule, skip it and return nil. io.Reader should be UTF-8 byte stream.

func (*Extractor) ExtractURL

func (e *Extractor) ExtractURL(rawurl string) (*Content, error)

ExtractURL fetch contents from URL, and parse it, and extract.

func (*Extractor) Init

func (e *Extractor) Init()

Init initialize extractor.

func (*Extractor) Load

func (e *Extractor) Load(r io.Reader) error

Load loads rule from reader and extract rules.

func (*Extractor) LoadFile

func (e *Extractor) LoadFile(path string) error

LoadFile load LDRFullFeed JSON from filepath

func (*Extractor) Match

func (e *Extractor) Match(rawurl string) *Rule

Match returns if provided URL is match wedata rules.

TODO: use prefix match for faster matching.

type IExtractor

type IExtractor interface {
	Extract(r io.Reader, rawurl string)
}

IExtractor is interface of extractor.

type Rule

type Rule struct {
	ResourceURL         string    `json:"resource_url"`
	Name                string    `json:"name"`
	CreatedBy           string    `json:"created_by"`
	DatabaseResourceURL string    `json:"database_resource_url"`
	UpdatedAt           time.Time `json:"updated_at"`
	CreatedAt           time.Time `json:"created_at"`
	Data                struct {
		URL          string `json:"url"` // URL is the regex pattern of target pages
		Type         string `json:"type"`
		Enc          string `json:"enc"` // Enc is encoding of the page contents
		XPath        string `json:"xpath"`
		Base         string `json:"base"`
		MicroFormats string `json:"microformats"`
	}
	// contains filtered or unexported fields
}

Rule is

func (*Rule) URL

func (r *Rule) URL() string

URL get raw URL from rule. URL is raw regular expression string.

func (*Rule) XPath

func (r *Rule) XPath() string

XPath get XPath from rule.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL