dtoo

package module

v0.0.0-...-a9ca5d3 Latest Latest Go to latest Published: Dec 13, 2014 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/dschnare/dtoo

Links

Open Source Insights

README ¶

Overview

Dtoo exposes an HTML scraper API inspired by the artoo.js Scrape API.

The dtoo scrape API closely follows artoo's scrape API, but is slightly modified to suit Go. The biggest changes are that "scrapeTable" is not implemented and "scrapeOne" is implemented via the ScrapeFromXxxWithLimit functions.

The artoo example

artoo.scrape('li', {id: 'id', content: 'text'});

written for dtoo would look like this.

dtoo.ScrapeFromUrl("li", dtoo.Model{"id": "id", "content": "text"}, url)

The above example will return a slice of dtoo.Model objects each with the following keys: {id, content}.

The dtoo data model passed to the ScrapeXxx and ScrapeXxxWithLimit functions can be a string, func (s *goquery.Selection) (interface{}, error), dtoo.Model or dtoo.RetrieverModel.

Retrieves a slice of post id attributes using a string data model.

dtoo.ScrapeFromUrl(".post", "id", url)

Retrieves a slice of post comments using a function data model. When using a function you are exposed to the low level goquery API for scraping out content from the DOM.

type Comment struct {
	Author sring
	Content string
}

dtoo.ScrapeFromUrl(".post", func (s *goquery.Selection) (interface{}, error) {
	var err error = nil
	comments := make([]Comment, 0)

	s.Find(".comments").EachWithBreak(func (i int, s *goquery.Selection) bool {
		comment := Comment{
			Author: s.Find('.comment-author').Text(),
		}

		if content,e := s.Find('.comment-content').Html(); e == nil {
			comment.Content = content
		} else {
			// Save error and exit
			err = e
			return false
		}

		comments = append(comments, comment)
		return true
	})

	return comments, err
}, url)

Retrieves a slice of dtoo.Model objects each with the following keys {Id, Title, PublishDate}.

dtoo.ScrapeFromUrl(".post", dtoo.Model{
	"Id": "id",
	"Title": "title",
	"PublishDate": "datetime"
}, url)

Using dtoo.Model you can nest dtoo.RetrieverModel objects and functions as keys in your model to retrieve infinitely complex models. This example retrieves a slice of dtoo.Model objects with the following properties {Id, Title, PublishedDate, Comments: [{Id, Author, Content}]}.

dtoo.ScrapeFromUrl(".post", dtoo.Model{
	"Id": "id",
	"Title": dtoo.RetrieverModel{Sel: ".post-title", Method: "text"},
	"PublishDate": dtoo.RetrieverModel{Sel: ".publisehed-date", Attr: "datetime"},
	"Comments": func (s *goquery.Selection) (interface{}, error) {
		// We make a recursive call to dtoo.Scrape to make things easy.
		return dtoo.Scrape(".comment", dtoo.Model{
			"Id": "id",
			"Author": dtoo.RetrieverModel{Sel: ".comment-author", Method: "text"},
			"Content": dtoo.RetrieverModel{Sel: ".comment-content", Method: "html"},
		}, s)
	},
}, url)

The above example could be alternatively achieved by using the recursive Scrape setting of RetrieverModel.

dtoo.ScrapeFromUrl(".post", dtoo.Model{
	"Id": "id",
	"Title": dtoo.RetrieverModel{Sel: ".post-title", Method: "text"},
	"PublishDate": dtoo.RetrieverModel{Sel: ".publisehed-date", Attr: "datetime"},
	"Comments": dtoo.RetrieverModel{
		Scrape: dtoo.ScrapeObject{
			Iterator: ".comment", 
			Data: dtoo.Model{
				"Id": "id",
				"Author": dtoo.RetrieverObject{Sel: ".comment-author", Method: "text"},
				"Content": dtoo.RetrieverObject{Sel: ".comment-content", Method: "html"},
			},
		},
	},
}, url)

Documentation ¶

Overview ¶

Dtoo exposes an HTML scraper API inspired by the artoo.js Scrape API https://medialab.github.io/artoo/scrape/.

The dtoo scrape API closely follows artoo's scrape API, but is slightly modified to suit Go. The biggest changes are that "scrapeTable" is not implemented and "scrapeOne" is implemented via the ScrapeFromXxxWithLimit functions.

The artoo example

artoo.scrape('li', {id: 'id', content: 'text'});

written for dtoo would look like this.

dtoo.ScrapeFromUrl("li", dtoo.Model{"id": "id", "content": "text"}, url)

The above example will return a slice of dtoo.Model objects each with the following keys: {id, content}.

The dtoo data model passed to the ScrapeXxx and ScrapeXxxWithLimit functions can be a string, func (s *goquery.Selection) (interface{}, error), dtoo.Model or dtoo.RetrieverModel.

Retrieves a slice of post id attributes using a string data model.

dtoo.ScrapeFromUrl(".post", "id", url)

Retrieves a slice of post comments using a function data model. When using a function you are exposed to the low level goquery API for scraping out content from the DOM.

type Comment struct {
  Author sring
  Content string
}

dtoo.ScrapeFromUrl(".post", func (s *goquery.Selection) (interface{}, error) {
  var err error = nil
  comments := make([]Comment, 0)

  s.Find(".comments").EachWithBreak(func (i int, s *goquery.Selection) bool {
    comment := Comment{
      Author: s.Find('.comment-author').Text(),
    }

    if content,e := s.Find('.comment-content').Html(); e == nil {
      comment.Content = content
    } else {
      // Save error and exit
      err = e
      return false
    }

    comments = append(comments, comment)
    return true
  })

  return comments, err
}, url)

Retrieves a slice of dtoo.Model objects each with the following keys {Id, Title, PublishDate}.

dtoo.ScrapeFromUrl(".post", dtoo.Model{
  "Id": "id",
  "Title": "title",
  "PublishDate": "datetime"
}, url)

Using dtoo.Model you can nest dtoo.RetrieverModel objects and functions as keys in your model to retrieve infinitely complex models. This example retrieves a slice of dtoo.Model objects with the following properties {Id, Title, PublishedDate, Comments: [{Id, Author, Content}]}.

dtoo.ScrapeFromUrl(".post", dtoo.Model{
  "Id": "id",
  "Title": dtoo.RetrieverModel{Sel: ".post-title", Method: "text"},
  "PublishDate": dtoo.RetrieverModel{Sel: ".publisehed-date", Attr: "datetime"},
  "Comments": func (s *goquery.Selection) (interface{}, error) {
    // We make a recursive call to dtoo.Scrape to make things easy.
    return dtoo.Scrape(".comment", dtoo.Model{
      "Id": "id",
      "Author": dtoo.RetrieverModel{Sel: ".comment-author", Method: "text"},
      "Content": dtoo.RetrieverModel{Sel: ".comment-content", Method: "html"},
    }, s)
  },
}, url)

The above example could be alternatively achieved by using the recursive Scrape setting of RetrieverModel.

dtoo.ScrapeFromUrl(".post", dtoo.Model{
  "Id": "id",
  "Title": dtoo.RetrieverModel{Sel: ".post-title", Method: "text"},
  "PublishDate": dtoo.RetrieverModel{Sel: ".publisehed-date", Attr: "datetime"},
  "Comments": dtoo.RetrieverModel{
    Scrape: dtoo.ScrapeObject{
      Iterator: ".comment",
      Data: dtoo.Model{
        "Id": "id",
        "Author": dtoo.RetrieverObject{Sel: ".comment-author", Method: "text"},
        "Content": dtoo.RetrieverObject{Sel: ".comment-content", Method: "html"},
      },
    },
  },
}, url)

Index ¶

Constants
func Scrape(iterator string, model interface{}, s *goquery.Selection, limit uint) ([]interface{}, error)
func ScrapeFromReader(iterator string, model interface{}, r io.Reader) ([]interface{}, error)
func ScrapeFromReaderWithLimit(iterator string, model interface{}, r io.Reader, limit uint) ([]interface{}, error)
func ScrapeFromString(iterator string, model interface{}, html string) ([]interface{}, error)
func ScrapeFromStringWithLimit(iterator string, model interface{}, html string, limit uint) ([]interface{}, error)
func ScrapeFromUrl(iterator string, model interface{}, url string) ([]interface{}, error)
func ScrapeFromUrlWithLimit(iterator string, model interface{}, url string, limit uint) ([]interface{}, error)
type Model
type RetrieverModel
type ScrapeObject

Constants ¶

View Source

const (
	EMPTYSTRING = ""
)

EMPTYSTRING is a convenient constant for an empty string.

Variables ¶

This section is empty.

Functions ¶

func Scrape ¶

func Scrape(iterator string, model interface{}, s *goquery.Selection, limit uint) ([]interface{}, error)

Scrape scrapes content from a goquery.Selection object according to the data model specified. Takes a selector as its root iterator and then takes the data model you intend to extract at each iteration. Will iterate up to limit number of iterations. If limit is 0 then no limit is applied. The data model can be a string, (s *goquery.Selection) (interface{}, error), Model or RetrieverModel.