document

package
v0.0.0-...-c5d5a31 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 21, 2020 License: Apache-2.0 Imports: 17 Imported by: 5

Documentation

Overview

Package document parses URLs and the HTML of a webpage

Index

Constants

This section is empty.

Variables

View Source
var Matcher = language.NewMatcher(available) // globals...ugh!

Matcher is a language matcher. Will need to change if we can figure out language customization (see note above)

Functions

func ExtractDomain

func ExtractDomain(u *url.URL) (string, error)

ExtractDomain extracts the domain from a *url.URL e.g. "example.com" from "https://www.example.com/path/somewhere"

func Languages

func Languages(supported []language.Tag) []language.Tag

Languages (will) verifies that languages are supported. An empty slice of supported languages implies you support every language available. How to make this configurable? We crawl a doc we don't support it goes to a matcher where it will just match the first language supported. Tricky. Once we are ready look at wikipedia package implementation.

func ValidateURL

func ValidateURL(lnk string) (*url.URL, error)

ValidateURL validates a link and returns a *url.URL Note: There seems to be a lot of overlap between this and handleLink()

Types

type Content

type Content struct {
	StatusCode int `json:"status,omitempty"`

	Canonical   bool         `json:"canonical,omitempty"`
	Language    language.Tag `json:"-"`
	Date        string       `json:"date,omitempty"`
	Title       string       `json:"title,omitempty"`
	Keywords    string       `json:"keywords,omitempty"`
	Description string       `json:"description,omitempty"`
	Policy
	// contains filtered or unexported fields
}

Content is set from the response

type Document

type Document struct {
	ID        string   `json:"id"` // store ID also as a field as sorting on document ID is not advised in Elasticsearch
	URL       *url.URL `json:"-"`
	Scheme    string   `json:"scheme,omitempty"`
	Host      string   `json:"host,omitempty"`       // not HostName()...we want the port for the robots.txt file
	Domain    string   `json:"domain,omitempty"`     // tld+1 -> example.com
	TLD       string   `json:"tld,omitempty"`        // com, org, uk, etc (we don't want co.uk just uk)
	PathParts string   `json:"path_parts,omitempty"` // https://api.example.com/path/to/something -> "path to something"
	Crawled   string   `json:"crawled,omitempty"`

	MIME string `json:"mime,omitempty"`

	Content
	// contains filtered or unexported fields
}

Document is the URL & parsed content of the page Note, since we want just a couple of fields from *url.URL (Scheme, Host) we explicitly set those. Much easier than a custom MarshalJSON method.

func New

func New(lnk string) (*Document, error)

New creates a new Document from a link and validates the url

func (*Document) SchemeHost

func (d *Document) SchemeHost() string

SchemeHost simply concatenates the Scheme, '://', and Host

func (*Document) SetCanonical

func (d *Document) SetCanonical(ch chan string) *Document

SetCanonical sets Canonical to true if the Document's ID is the canonical URL

func (*Document) SetContent

func (d *Document) SetContent(bot string, maxLinks int, links chan string, images chan *img.Image,
	truncateTitle, truncateKeywords, truncateDescription int) error

SetContent parses the html and sets the language, title, description, extracts links, etc.

func (*Document) SetCrawled

func (d *Document) SetCrawled(t time.Time) *Document

SetCrawled marks the date the doc was crawled

func (*Document) SetHeader

func (d *Document) SetHeader(h http.Header) *Document

SetHeader sets the Document's header to the response header.

func (*Document) SetPolicyFromHeader

func (d *Document) SetPolicyFromHeader(bot string) *Document

SetPolicyFromHeader sets the indexing & follow policy of a document from the response header. A specific bot directive overrides a general robots directive (still TODO). We process the X-Robots-Tag header first so may not even get to the meta tag found in the html. https://developers.google.com/search/reference/robots_meta_tag https://stackoverflow.com/a/18330818/776942 (see end of answer) TODO: Process the bot directive.

func (*Document) SetStatusCode

func (d *Document) SetStatusCode(code int) *Document

SetStatusCode sets the http status code

func (*Document) SetTokenizer

func (d *Document) SetTokenizer(b io.Reader) error

SetTokenizer sets the html tokenizer and MIME Type from the response's body (utf-8 encoded). It is the caller's responsibility to close the response body.

type ElasticSearch

type ElasticSearch struct {
	Client *elastic.Client
	Index  string
	Type   string
}

ElasticSearch hold connection and index settings

func (*ElasticSearch) Analyzer

func (e *ElasticSearch) Analyzer(lang language.Tag) (string, error)

Analyzer returns the appropriate analyzer for a given language.

func (*ElasticSearch) IndexName

func (e *ElasticSearch) IndexName(a string) string

IndexName returns the language-specific index e.g. "search-english", "search-french"

func (*ElasticSearch) Setup

func (e *ElasticSearch) Setup() error

Setup will create our main search index and language-specific indices for the content

type Policy

type Policy struct {
	Index bool `json:"index,omitempty"` // are we allowed to index the page?
	// contains filtered or unexported fields
}

Policy tells us if we can index the content & store the links

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL