sauron

package module
v0.0.0-...-315fa5b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 1, 2020 License: Apache-2.0 Imports: 12 Imported by: 1

README

Sauron

Sauron is an extensible page parser written in Go.

The purpose of Sauron is to enable the easy implementation of page or website parsers, with first-class support for common platforms or sites such as Reddit and Youtube.

godoc goreportcard GitHub

Using

To use Sauron in your application, all you need to do is ensure you are importing github.com/TryStreambits/sauron. Then follow the documentation linked above or look at tests/ for example code.

Building

To compile, first ensure you have turned on Go Module support if you are working inside your GOPATH:

export GO111MODULE=on

Next, all you have to do is run the following command to compile:

go build

License

Sauron is licensed under the Apache-2.0 license.

Documentation

Index

Constants

View Source
const (
	// HostAlreadyRegistered is an error message for when host already has registered parser
	HostAlreadyRegistered = "Host already has a registered parser"

	// NoResponse is an error message for when we fail to get a response from a page. This may occur for timeouts.
	NoResponse = "No response from client to page"

	// PageContentNotValid is an error message for when the page requested is not HTML
	PageContentNotValid = "Page content provided is not valid HTML"

	// PageNotAccessible is an error message for when we get a non-200 status from a page
	PageNotAccessible = "Page not accessible"
)

Variables

View Source
var ChannelRequestJSON string
View Source
var ClipRequestJSON string
View Source
var HasOverriddenInternals map[string]bool

HasOverriddenInternals is a map of our internal parsers and if they have been overridden

View Source
var HostToParsers map[string]LinkParser

HostToParsers is our map of hostnames to custom parsers

View Source
var MetaImageNames []string

MetaImageNames is an array of meta names commonly associated with site images

View Source
var RequestLanguage string

RequestLanguage is the desired language to request a page with. Defaults to en-US / en

View Source
var UserAgent string

UserAgent is the desired User Agent to report to a page via request. Defaults to Sauron Bot $VERSION (e.g. Sauron Bot 0.1)

View Source
var YoutubeQueriesToExtras map[string]string

YoutubeQueriesToExtras is query info to extra metadata

Functions

func ForceRegister

func ForceRegister(hostName string, parser LinkParser) error

ForceRegister will force register a LinkParser against the provided hostname This is identical to calling Unregister then Register.

func HasOverridden

func HasOverridden(host string) (overridden bool)

HasOverridden will check if our internal parsers have been overridden

func NewHTTPClient

func NewHTTPClient(u *url.URL) (client http.Client, request http.Request)

NewHTTPClient will create a new request-specific client, with our defined user agent, for the purposes of page fetching. If successful, it will return both the client and the request for use

func Register

func Register(hostName string, parser LinkParser) (regErr error)

Register will attempt to register the provided parser for a specific hostname Hostname can be an exact match, such as "google.com" or regex. Attempting to register when a LinkParser is already associated will return an error.

func SetRequestLanguage

func SetRequestLanguage(lang string) error

SetRequestLanguage will set the Accept-Language header for page requests This does not necessarily mean the page supports the language or will return with that language

func SetUserAgent

func SetUserAgent(agent string) error

func Unregister

func Unregister(hostName string)

Unregister will unregister a LinkParser with the specified hostname

Types

type Link struct {
	Description, Favicon, Host, Image, Title, URI string

	// Extras is our extra metadata.
	// This may be used by internal and external parsers to communicate additional information about the URL in question
	Extras map[string]string
}

Link is our structured information about a URL provided to Sauron's Parser

func GetLink(urlPath string) (link *Link, parseErr error)

GetLink will get the link information for the provided url

func Primitive

func Primitive(doc *goquery.Document, url *url.URL, fullURL string) (link *Link, parserErr error)

Primitive is our primitive parser This parser will get standard page information from the most commonly supported DOM Elements

func Reddit

func Reddit(doc *goquery.Document, url *url.URL, fullURL string) (link *Link, parserErr error)

Reddit is our internal Reddit parser This parser will get page information as well as Reddit post information such as dislikes, likes, and overall score

func Twitch

func Twitch(_doc *goquery.Document, url *url.URL, fullURL string) (link *Link, parserErr error)

Twitch is our internal Twitch parser This parser will leverage Twitch's GQL (used during info fetching for page content generation) to get various JSON data for the request

func Youtube

func Youtube(doc *goquery.Document, url *url.URL, fullURL string) (link *Link, parserErr error)

Youtube is our internal Youtube parser This parser will get page information as well as add extra metadata for various shorteners and form factors

type LinkParser

type LinkParser func(*goquery.Document, *url.URL, string) (*Link, error)

LinkParser is a function which takes in a parsed document, URL struct and a string, and returns a pointer to a Link or an error

type TwitchGqlBroadcastSettings

type TwitchGqlBroadcastSettings struct {
	ID       string        `json:"id"`
	Language string        `json:"language"`
	Game     TwitchGqlGame `json:"game,omitempty"`
	Title    string        `json:"title"`
}

TwitchGqlBroadcastSettings is various broadcast settings

type TwitchGqlChannelResponse

type TwitchGqlChannelResponse struct {
	Data TwitchGqlData `json:"data"`
}

TwitchGqlChannelResponse is some of the possible GQL response we get from the Twitch endpoint if it is a channel

type TwitchGqlClipData

type TwitchGqlClipData struct {
	Broadcaster TwitchGqlUser `json:"broadcaster"`
	Game        TwitchGqlGame `json:"game,omitempty"`
	Slug        string        `json:"slug"`
	Title       string        `json:"title"`
}

type TwitchGqlClipResponse

type TwitchGqlClipResponse struct {
	Data TwitchGqlClipRoot `json:"data"`
}

type TwitchGqlClipRoot

type TwitchGqlClipRoot struct {
	Clip TwitchGqlClipData `json:"clip"`
}

type TwitchGqlData

type TwitchGqlData struct {
	CurrentUser string          `json:"currentUser,omitempty"`
	Stream      TwitchGqlStream `json:"stream,omitempty"`
	User        TwitchGqlUser   `json:"user"`
}

TwitchGqlData is some of the possible GQL data

type TwitchGqlGame

type TwitchGqlGame struct {
	ID          string `json:"id"`
	BoxArtURL   string `json:"boxArtURL"`
	DisplayName string `json:"displayName"`
	Name        string `json:"name"`
	TypeName    string `json:"__typename"`
}

TwitchGqlGame is various game related settings

type TwitchGqlRoles

type TwitchGqlRoles struct {
	IsAffiliate bool   `json:"isAffiliate"`
	IsPartner   bool   `json:"isPartner"`
	IsStaff     bool   `json:"isStaff,omitempty"`
	TypeName    string `json:"__typename"`
}

type TwitchGqlStream

type TwitchGqlStream struct {
	Type string `json:"type,omitempty"`
}

type TwitchGqlUser

type TwitchGqlUser struct {
	ID                    string                     `json:"id"`
	BroadcastSettings     TwitchGqlBroadcastSettings `json:"broadcastSettings"`
	DisplayName           string                     `json:"displayName"`
	Login                 string                     `json:"login"`
	ProfileImageURL       string                     `json:"profileImageURL"`
	MediumProfileImageURL string                     `json:"medProfileImageUrl"`
	Roles                 TwitchGqlRoles             `json:"roles"`
}

TwitchGqlResponseUser is various user data from GQL

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL