document

package

v0.0.0-...-c5d5a31 Latest Latest Go to latest Published: Nov 21, 2020 License: Apache-2.0 Imports: 17 Imported by: 5

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/jivesearch/jivesearch

Documentation ¶

Overview ¶

Package document parses URLs and the HTML of a webpage

Index ¶

Variables
func ExtractDomain(u *url.URL) (string, error)
func Languages(supported []language.Tag) []language.Tag
func ValidateURL(lnk string) (*url.URL, error)
type Content
type Document
- func New(lnk string) (*Document, error)
type ElasticSearch
type Policy

Constants ¶

This section is empty.

Variables ¶

View Source

var Matcher = language.NewMatcher(available) // globals...ugh!

Matcher is a language matcher. Will need to change if we can figure out language customization (see note above)

Functions ¶

func ExtractDomain ¶

func ExtractDomain(u *url.URL) (string, error)

ExtractDomain extracts the domain from a *url.URL e.g. "example.com" from "https://www.example.com/path/somewhere"

func Languages ¶

func Languages(supported []language.Tag) []language.Tag

Languages (will) verifies that languages are supported. An empty slice of supported languages implies you support every language available. How to make this configurable? We crawl a doc we don't support it goes to a matcher where it will just match the first language supported. Tricky. Once we are ready look at wikipedia package implementation.

func ValidateURL ¶

func ValidateURL(lnk string) (*url.URL, error)

ValidateURL validates a link and returns a *url.URL Note: There seems to be a lot of overlap between this and handleLink()

Types ¶

type Content ¶

type Content struct {
	StatusCode int `json:"status,omitempty"`

	Canonical   bool         `json:"canonical,omitempty"`
	Language    language.Tag `json:"-"`
	Date        string       `json:"date,omitempty"`
	Title       string       `json:"title,omitempty"`
	Keywords    string       `json:"keywords,omitempty"`
	Description string       `json:"description,omitempty"`
	Policy
	// contains filtered or unexported fields
}

Content is set from the response

type Document ¶

type Document struct {
	ID        string   `json:"id"` // store ID also as a field as sorting on document ID is not advised in Elasticsearch
	URL       *url.URL `json:"-"`
	Scheme    string   `json:"scheme,omitempty"`
	Host      string   `json:"host,omitempty"`       // not HostName()...we want the port for the robots.txt file
	Domain    string   `json:"domain,omitempty"`     // tld+1 -> example.com
	TLD       string   `json:"tld,omitempty"`        // com, org, uk, etc (we don't want co.uk just uk)
	PathParts string   `json:"path_parts,omitempty"` // https://api.example.com/path/to/something -> "path to something"
	Crawled   string   `json:"crawled,omitempty"`

	MIME string `json:"mime,omitempty"`

	Content
	// contains filtered or unexported fields
}

Document is the URL & parsed content of the page Note, since we want just a couple of fields from *url.URL (Scheme, Host) we explicitly set those. Much easier than a custom MarshalJSON method.

func New ¶

func New(lnk string) (*Document, error)

New creates a new Document from a link and validates the url

func (*Document) SchemeHost ¶

func (d *Document) SchemeHost() string

SchemeHost simply concatenates the Scheme, '://', and Host

func (*Document) SetCanonical ¶

func (d *Document) SetCanonical(ch chan string) *Document

SetCanonical sets Canonical to true if the Document's ID is the canonical URL

func (*Document) SetContent ¶

func (d *Document) SetContent(bot string, maxLinks int, links chan string, images chan *img.Image,
	truncateTitle, truncateKeywords, truncateDescription int) error

SetContent parses the html and sets the language, title, description, extracts links, etc.

func (*Document) SetCrawled ¶

func (d *Document) SetCrawled(t time.Time) *Document

SetCrawled marks the date the doc was crawled

func (*Document) SetHeader ¶

func (d *Document) SetHeader(h http.Header) *Document

SetHeader sets the Document's header to the response header.

func (*Document) SetPolicyFromHeader ¶

func (d *Document) SetPolicyFromHeader(bot string) *Document

SetPolicyFromHeader sets the indexing & follow policy of a document from the response header. A specific bot directive overrides a general robots directive (still TODO). We process the X-Robots-Tag header first so may not even get to the meta tag found in the html. https://developers.google.com/search/reference/robots_meta_tag https://stackoverflow.com/a/18330818/776942 (see end of answer) TODO: Process the bot directive.

func (*Document) SetStatusCode ¶

func (d *Document) SetStatusCode(code int) *Document

SetStatusCode sets the http status code

func (*Document) SetTokenizer ¶

func (d *Document) SetTokenizer(b io.Reader) error

SetTokenizer sets the html tokenizer and MIME Type from the response's body (utf-8 encoded). It is the caller's responsibility to close the response body.

type ElasticSearch ¶

type ElasticSearch struct {
	Client *elastic.Client
	Index  string
	Type   string
}

ElasticSearch hold connection and index settings

func (*ElasticSearch) Analyzer ¶

func (e *ElasticSearch) Analyzer(lang language.Tag) (string, error)

Analyzer returns the appropriate analyzer for a given language.

func (*ElasticSearch) IndexName ¶

func (e *ElasticSearch) IndexName(a string) string

IndexName returns the language-specific index e.g. "search-english", "search-french"

func (*ElasticSearch) Setup ¶

func (e *ElasticSearch) Setup() error

Setup will create our main search index and language-specific indices for the content

type Policy ¶

type Policy struct {
	Index bool `json:"index,omitempty"` // are we allowed to index the page?
	// contains filtered or unexported fields
}

Policy tells us if we can index the content & store the links

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL