scraping

package
v0.0.0-...-3d7a921 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 23, 2022 License: Apache-2.0 Imports: 14 Imported by: 0

Documentation

Index

Constants

View Source
const (
	SoftHyphen = "\u00ad"
)

Variables

This section is empty.

Functions

func HostToDomain

func HostToDomain(host string) string

HostToDomain gets the domain from the hostname.

func ReadFile

func ReadFile(input string) ([]api.Site, error)

ReadFile reads a JSONL file containing a list of sites

func TextFromHtml

func TextFromHtml(reader io.Reader) string

TextFromHtml returns all the text from html. Based on: https://stackoverflow.com/questions/44441665/how-to-extract-only-text-from-html-in-golang

func URLToDomain

func URLToDomain(u string) string

URLToDomain normalizes the URL to a domain. Returns empty string on error.

func WriteFile

func WriteFile(outFile string, sites []api.Site) error

WriteFile writes the sites to the file as JSONL

Types

type Scraper

type Scraper struct {
	OutputDir string
	Log       logr.Logger
	Force     bool
}

Scraper scrapes a bunch of sites

func (*Scraper) Scrape

func (s *Scraper) Scrape(site api.Site) error

Scrape a single site

func (*Scraper) Sites

func (s *Scraper) Sites(sites []api.Site) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL