heroscrape

package module
v0.0.0-...-011d13f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 24, 2018 License: MIT Imports: 10 Imported by: 0

README

hero-scrape

Find the hero (main) image of an URL

Build Status codecov

By v-braun - viktor-braun.de.

Demo

See a demo on https://hero-scrape.viktor-braun.de

Description

hero-scrape extracts the main image of a webpage. It use different strategies to find the main images (OpenGraph HTML Tags and heuristic search). You can use the existing strategies or implement your own.

To find the "biggest" image it is necessary to download it. fastimage is the perfect choice for that job.

Installation

go get github.com/v-braun/hero-scrape

Usage

With pre configured strategies

pageUrl, _ := url.Parse("https://github.com/v-braun/hero-scrape")
res, _ := http.Get(pageUrl.String())
defer res.Body.Close()

result, _ := heroscrape.Scrape(pageUrl, res.Body)
fmt.Println(result.Image)

With cusom strategies

pageUrl, _ := url.Parse("https://github.com/v-braun/hero-scrape")
res, _ := http.Get(pageUrl.String())
defer res.Body.Close()

result, _ := heroscrape.ScrapeWithStrategy(pageUrl, res.Body, , NewOgStrategy(), NewHeuristicStrategy(), YourOwnStrategy())
fmt.Println(result.Image)

  • hero-scrape Demo for this lib
  • fastimage Finds the type and/or size of a remote image given its uri, by fetching as little as needed.
  • goquery A little like that j-thing, only in Go.

Known Issues

If you discover any bugs, feel free to create an issue on GitHub fork and send me a pull request.

Issues List.

Authors

image
v-braun

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

License

See LICENSE.

Documentation

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrNotComplete = errors.New("Not complete")

ErrNotComplete will be returned if the Scrape was not completley done

View Source
var Logger = log.New(ioutil.Discard, "hero-scrape", log.LstdFlags)

Logger instance for the entire module

Functions

func Debug

func Debug()

Debug enables the module log debugging

func GetAttrFromSelector

func GetAttrFromSelector(doc *goquery.Document, selector string, attrName string) string

Types

type ImageLocation

type ImageLocation string

type SearchResult

type SearchResult struct {
	Image       string
	Title       string
	Description string
}

SearchResult represents the scrape result

func Scrape

func Scrape(srcURL *url.URL, html io.Reader) (*SearchResult, error)

Scrape the given url

Example
package main

import (
	"fmt"
	"net/http"
	"net/url"

	heroscrape "github.com/v-braun/hero-scrape"
)

func main() {
	pageUrl, _ := url.Parse("https://github.com/v-braun/hero-scrape")
	res, _ := http.Get(pageUrl.String())
	defer res.Body.Close()

	result, _ := heroscrape.Scrape(pageUrl, res.Body)
	fmt.Println(result.Image)
}
Output:

func ScrapeWithStrategy

func ScrapeWithStrategy(srcURL *url.URL, html io.Reader, strategies ...Strategy) (*SearchResult, error)

ScrapeWithStrategy scrapes the given url with the given strategy

func (*SearchResult) Complete

func (sr *SearchResult) Complete() bool

Complete returns true if the SearchResult has found everything

type Strategy

type Strategy interface {
	Scrape(srcURL *url.URL, doc *goquery.Document) (*SearchResult, error)
}

Strategy interface represents an interface for scraping an website

func NewHeuristicStrategy

func NewHeuristicStrategy() Strategy

func NewOgStrategy

func NewOgStrategy() Strategy

NewOgStrategy returns a new Strategy that search for OG meta tags

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL