heroscrape

package module

v0.0.0-...-011d13f Latest Latest Go to latest Published: Dec 24, 2018 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/v-braun/hero-scrape

Links

Open Source Insights

README ¶

hero-scrape

Find the hero (main) image of an URL

By v-braun - viktor-braun.de.

Demo

See a demo on https://hero-scrape.viktor-braun.de

Description

hero-scrape extracts the main image of a webpage. It use different strategies to find the main images (OpenGraph HTML Tags and heuristic search). You can use the existing strategies or implement your own.

To find the "biggest" image it is necessary to download it. fastimage is the perfect choice for that job.

Installation

go get github.com/v-braun/hero-scrape

Usage

With pre configured strategies

pageUrl, _ := url.Parse("https://github.com/v-braun/hero-scrape")
res, _ := http.Get(pageUrl.String())
defer res.Body.Close()

result, _ := heroscrape.Scrape(pageUrl, res.Body)
fmt.Println(result.Image)

With cusom strategies

pageUrl, _ := url.Parse("https://github.com/v-braun/hero-scrape")
res, _ := http.Get(pageUrl.String())
defer res.Body.Close()

result, _ := heroscrape.ScrapeWithStrategy(pageUrl, res.Body, , NewOgStrategy(), NewHeuristicStrategy(), YourOwnStrategy())
fmt.Println(result.Image)

hero-scrape Demo for this lib
fastimage Finds the type and/or size of a remote image given its uri, by fetching as little as needed.
goquery A little like that j-thing, only in Go.

Known Issues

If you discover any bugs, feel free to create an issue on GitHub fork and send me a pull request.

Issues List.

Authors

v-braun

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

License

See LICENSE.

Documentation ¶

Index ¶

Variables
func Debug()
func GetAttrFromSelector(doc *goquery.Document, selector string, attrName string) string
type ImageLocation
type SearchResult
- func Scrape(srcURL *url.URL, html io.Reader) (*SearchResult, error)
- func ScrapeWithStrategy(srcURL *url.URL, html io.Reader, strategies ...Strategy) (*SearchResult, error)
- func (sr *SearchResult) Complete() bool
type Strategy
- func NewHeuristicStrategy() Strategy
- func NewOgStrategy() Strategy

Examples ¶

Scrape

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrNotComplete = errors.New("Not complete")

ErrNotComplete will be returned if the Scrape was not completley done

View Source

var Logger = log.New(ioutil.Discard, "hero-scrape", log.LstdFlags)

Logger instance for the entire module

Functions ¶

func Debug ¶

func Debug()

Debug enables the module log debugging

func GetAttrFromSelector ¶

func GetAttrFromSelector(doc *goquery.Document, selector string, attrName string) string

Types ¶

type ImageLocation ¶

type ImageLocation string

type SearchResult ¶

type SearchResult struct {
	Image       string
	Title       string
	Description string
}

SearchResult represents the scrape result

func Scrape ¶

func Scrape(srcURL *url.URL, html io.Reader) (*SearchResult, error)

Scrape the given url

Example ¶

package main

import (
	"fmt"
	"net/http"
	"net/url"

	heroscrape "github.com/v-braun/hero-scrape"
)

func main() {
	pageUrl, _ := url.Parse("https://github.com/v-braun/hero-scrape")
	res, _ := http.Get(pageUrl.String())
	defer res.Body.Close()

	result, _ := heroscrape.Scrape(pageUrl, res.Body)
	fmt.Println(result.Image)
}

Output:

func ScrapeWithStrategy ¶

func ScrapeWithStrategy(srcURL *url.URL, html io.Reader, strategies ...Strategy) (*SearchResult, error)

ScrapeWithStrategy scrapes the given url with the given strategy

func (*SearchResult) Complete ¶

func (sr *SearchResult) Complete() bool

Complete returns true if the SearchResult has found everything

type Strategy ¶

type Strategy interface {
	Scrape(srcURL *url.URL, doc *goquery.Document) (*SearchResult, error)
}

Strategy interface represents an interface for scraping an website

func NewHeuristicStrategy ¶

func NewHeuristicStrategy() Strategy

func NewOgStrategy ¶

func NewOgStrategy() Strategy

NewOgStrategy returns a new Strategy that search for OG meta tags

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL