thumbscraper

package module
v0.0.0-...-d7b90b0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 25, 2020 License: MIT Imports: 10 Imported by: 0

README

thumbscraper

Repository Size License Top Language GoDoc

A web image scraper built in Go that can extract all images and optionally determine primary images of a web page. It can also then determine from a collection of scraped images what is the most appropriate image to use for thumbnail generation.

It uses the colly scraper to scrape elements from the DOM.

Installation

Simply run the following command to install the package to your $GOPATH.

go get "github.com/tyncture/thumbscraper"

And then you can use it in your project like so.

package main

import (
	"github.com/tyncture/thumbscraper"
	"os"
	"fmt"
)

func main() {
	imageNodes, err := thumbscraper.GetImageNodes("https://github.com/Tyncture/thumbscraper")
	if err != nil {
		// Failed to load the web page
		// Your error handling here
	}

	imageNodesInfo, err := thumbscraper.GetImageNodeInfoBatch(imageNodes)
	if err != nil {
		// If using RequireAll is supplied in the optional GetImageNodeInfoBatchOptions
		// and is set to true, GetImageNodeInfoBatch will return an error if it cannot 
		// retrieve all images.
		// Otherwise, if it is  false or the options argument is empty, it will run
		// to completion, even if some images cannot be processed successfully
		// Your error handling here
	}

	// Get the best thumbnail and print the URL
	thumbnail, err := thumbscraper.DetermineThumbnail(imageNodesInfo)
	if err != nil {
		fmt.Fprintln(os.Stderr, "No images to process")
	}
	fmt.Println(thumbnail.URL)
}

Documentation

type ImageNode
type ImageNode struct {
	Name           string
	Alt            string
	URL            string
	OpenGraphImage bool
}

ImageNode represents information relating to HTML images elements discovered on the requested URLs.

func GetImageNodes
func GetImageNodes(pageURL string) ([]ImageNode, error)

GetImageNodes returns an []ImageNode containing the names, alt tags, URLs and whether an image is from an OpenGraph image meta tag.

type ImageNodeInfo
type ImageNodeInfo struct {
	ImageNode
	Format string
	Height int
	Width  int
	Image  *image.Image
}

ImageNodeInfo represents information relating to image elements discovered on the requested URLs with additional useful information. Image is only populated if ScrapeImages is set to true in GetImageNodeInfoOptions or GetImageNodeInfoBatchOptions.

func GetImageNodeInfo
func GetImageNodeInfo(imageNode ImageNode, options ...GetImageNodeInfoOptions) (*ImageNodeInfo, error)

GetImageNodeInfo takes an ImageNode and returns an *ImageNodeInfo struct with additional properties received after loading and analysing the image itself. options is an optional GetImageNodeInfoOptions struct to specify whether to keep images in the returned ImageNodeInfo struct, default of which is false.

func GetImageNodeInfoBatch
func GetImageNodeInfoBatch(imageNodes []ImageNode,
	options ...GetImageNodeInfoBatchOptions) ([]*ImageNodeInfo, error)

GetImageNodeInfoBatch does the same thing as GetImageNodeInfo, but takes an ImageNode[] instead to allow you to get an []ImageNodeInfo back after processing them in batch. options is an optional GetImageNodeInfoBatch options struct to specify whether to keep images in the returned ImageNodeInfo structs, default of which is false, and whether to require all image requests to complete successfully, default of which is also false. Refer to struct type GetImageNodeInfoBatchOptions for more information.

type GetImageNodeInfoBatchOptions
type GetImageNodeInfoBatchOptions struct {
	GetImageNodeInfoOptions
	RequireAll bool
}

GetImageNodeInfoBatchOptions represents the configuration used by GetImageNodeInfoBatch. Default for RequireAll is false.

type GetImageNodeInfoOptions
type GetImageNodeInfoOptions struct {
	ScrapeImages bool
}

GetImageNodeInfoOptions represents the configuration used by GetImageNodeInfo. Default for ScrapeImages is false.

func EnforceURLSchema
func EnforceURLSchema(pageURL string, imageURL string) string

EnforceURLSchema enforces the proper URL format to allow requests to be made to retrieve them. Images embeded in HTML image elements are often missing the schema prefix. This is used by GetImageNodeInfo to ensure that the URL is valid before making a request for the image resource.

func DetermineThumbnail
func DetermineThumbnail(imageNodesWithInfo []*ImageNodeInfo) (*ImageNodeInfo, error)

DetermineThumbnail returns the *ImageNodeInfo for the best thumbnail from a []*ImageNodeInfo. The *ImageNodeInfo will also have the image itself in the Image property if ScrapeImages is set to true in GetImageNodeInfoBatchOptions that was passed into GetImageNodeInfoBatch. error is returned if the supplied []*ImageNodeInfo is empty.

License

MIT License

Copyright (c) 2019 John Su

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func EnforceURLSchema

func EnforceURLSchema(pageURL string, imageURL string) string

EnforceURLSchema enforces the proper URL format to allow requests to be made to retrieve them. Images embeded in HTML image elements are often missing the schema prefix. This is used by GetImageNodeInfo to ensure that the URL is valid before making a request for the image resource.

Types

type GetImageNodeInfoBatchOptions

type GetImageNodeInfoBatchOptions struct {
	GetImageNodeInfoOptions
	RequireAll bool
}

GetImageNodeInfoBatchOptions represents the configuration used by GetImageNodeInfoBatch. Default for RequireAll is false.

type GetImageNodeInfoOptions

type GetImageNodeInfoOptions struct {
	ScrapeImages bool
}

GetImageNodeInfoOptions represents the configuration used by GetImageNodeInfo. Default for ScrapeImages is false.

type ImageNode

type ImageNode struct {
	Name           string
	Alt            string
	URL            string
	OpenGraphImage bool
}

ImageNode represents information relating to HTML images elements discovered on the requested URLs.

func GetImageNodes

func GetImageNodes(pageURL string) ([]ImageNode, error)

GetImageNodes returns an []ImageNode containing the names, alt tags, URLs and whether an image is from an OpenGraph image meta tag.

type ImageNodeInfo

type ImageNodeInfo struct {
	ImageNode
	Format string
	Height int
	Width  int
	Image  *image.Image
}

ImageNodeInfo represents information relating to image elements discovered on the requested URLs with additional useful information. Image is only populated if ScrapeImages is set to true in ImageNodeInfoOptions or ImageNodeInfoBatchOptions.

func DetermineThumbnail

func DetermineThumbnail(imageNodesWithInfo []*ImageNodeInfo) (*ImageNodeInfo, error)

DetermineThumbnail returns the *ImageNodeInfo for the best thumbnail from a []*ImageNodeInfo. The *ImageNodeInfo will also have the image itself in the Image property if ScrapeImages is set to true in GetImageNodeInfoBatchOptions that was passed into GetImageNodeInfoBatch. error is returned if the supplied []*ImageNodeInfo is empty.

func GetImageNodeInfo

func GetImageNodeInfo(imageNode ImageNode, options ...GetImageNodeInfoOptions) (*ImageNodeInfo, error)

GetImageNodeInfo takes an ImageNode and returns an *ImageNodeInfo struct with additional properties received after loading and analysing the image itself. options is an optional GetImageNodeInfoOptions struct to specify whether to keep images in the returned ImageNodeInfo struct, default of which is false.

func GetImageNodeInfoBatch

func GetImageNodeInfoBatch(imageNodes []ImageNode,
	options ...GetImageNodeInfoBatchOptions) ([]*ImageNodeInfo, error)

GetImageNodeInfoBatch does the same thing as GetImageNodeInfo, but takes an []ImageNode instead to allow you to get an []ImageNodeInfo back after processing them in batch. options is an optional GetImageNodeInfoBatch options struct to specify whether to keep images in the returned ImageNodeInfo structs, default of which is false, and whether to require all image requests to complete successfully, default of which is also false. Refer to struct type GetImageNodeInfoBatchOptions for more information.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL