flyscrape

package module

v0.8.1 Latest Latest Go to latest Published: Feb 26, 2024 License: MPL-2.0 Imports: 28 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/philippta/flyscrape

Links

Open Source Insights

README ¶

Flyscrape is a command-line web scraping tool designed for those without
advanced programming skills, enabling precise extraction of website data.

Installation · Documentation · Releases

Demo

Features

Standalone: Flyscrape comes as a single binary executable.
jQuery-like: Extract data from HTML pages with a familiar API.
Scriptable: Use JavaScript to write your data extraction logic.
System Cookies: Give Flyscrape access to your browsers cookie store.
Browser Mode: Render JavaScript heavy pages using a headless Browser.

Overview

Example
Installation
Usage
Configuration
Query API
Flyscrape API
- Document Parsing
- File Downloads
Issues and suggestions

Example

This example scrapes the first few pages form Hacker News, specifically the New, Show and Ask sections.

export const config = {
    urls: [
        "https://news.ycombinator.com/new",
        "https://news.ycombinator.com/show",
        "https://news.ycombinator.com/ask",
    ],

    // Cache request for later.
    cache: "file",

    // Enable JavaScript rendering.
    browser: true,
    headless: false,

    // Follow pagination 5 times.
    depth: 5,
    follow: ["a.morelink[href]"],
}

export default function ({ doc, absoluteURL }) {
    const title = doc.find("title");
    const posts = doc.find(".athing");

    return {
        title: title.text(),
        posts: posts.map((post) => {
            const link = post.find(".titleline > a");

            return {
                title: link.text(),
                url: link.attr("href"),
            };
        }),
    }
}

$ flyscrape run hackernews.js
[
  {
    "url": "https://news.ycombinator.com/new",
    "data": {
      "title": "New Links | Hacker News",
      "posts": [
        {
          "title": "Show HN: flyscrape - An standalone and scriptable web scraper",
          "url": "https://flyscrape.com/"
        },
        ...
      ]
    }
  }
]

Check out the examples folder for more detailed examples.

Installation

Homebrew

For macOS users flyscrape is also available via homebrew:

brew install flyscrape

Pre-compiled binary

flyscrape is available for MacOS, Linux and Windows as a downloadable binary from the releases page.

Compile from source

To compile flyscrape from source, follow these steps:

Install Go: Make sure you have Go installed on your system. If not, you can download it from https://go.dev/.
Install flyscrape: Open a terminal and run the following command:
```
go install github.com/philippta/flyscrape/cmd/flyscrape@latest
```

Usage

Usage:

    flyscrape run SCRIPT [config flags]

Examples:

    # Run the script.
    $ flyscrape run example.js

    # Set the URL as argument.
    $ flyscrape run example.js --url "http://other.com"

    # Enable proxy support.
    $ flyscrape run example.js --proxies "http://someproxy:8043"

    # Follow paginated links.
    $ flyscrape run example.js --depth 5 --follow ".next-button > a"

    # Set the output format to ndjson.
    $ flyscrape run example.js --output.format ndjson

    # Write the output to a file.
    $ flyscrape run example.js --output.file results.json

Configuration

Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the documentation page.

export const config = {
    // Specify the URL to start scraping from.
    url: "https://example.com/",

    // Specify the multiple URLs to start scraping from.   (default = [])
    urls: [                          
        "https://anothersite.com/",
        "https://yetanother.com/",
    ],

    // Enable rendering with headless browser.             (default = false)
    browser: true,

    // Specify if browser should be headless or not.       (default = true)
    headless: false,

    // Specify how deep links should be followed.          (default = 0, no follow)
    depth: 5,                        

    // Speficy the css selectors to follow.                (default = ["a[href]"])
    follow: [".next > a", ".related a"],                      
 
    // Specify the allowed domains. ['*'] for all.         (default = domain from url)
    allowedDomains: ["example.com", "anothersite.com"],              
 
    // Specify the blocked domains.                        (default = none)
    blockedDomains: ["somesite.com"],              

    // Specify the allowed URLs as regex.                  (default = all allowed)
    allowedURLs: ["/posts", "/articles/\d+"],                 
 
    // Specify the blocked URLs as regex.                  (default = none)
    blockedURLs: ["/admin"],                 
   
    // Specify the rate in requests per minute.            (default = no rate limit)
    rate: 60,                       

    // Specify the number of concurrent requests.          (default = no limit)
    concurrency: 1,                       

    // Specify a single HTTP(S) proxy URL.                 (default = no proxy)
    // Note: Not compatible with browser mode.
    proxy: "http://someproxy.com:8043",

    // Specify multiple HTTP(S) proxy URLs.                (default = no proxy)
    // Note: Not compatible with browser mode.
    proxies: [
      "http://someproxy.com:8043",
      "http://someotherproxy.com:8043",
    ],                     

    // Enable file-based request caching.                  (default = no cache)
    cache: "file",                   

    // Specify the HTTP request header.                    (default = none)
    headers: {                       
        "Authorization": "Bearer ...",
        "User-Agent": "Mozilla ...",
    },

    // Use the cookie store of your local browser.         (default = off)
    // Options: "chrome" | "edge" | "firefox"
    cookies: "chrome",

    // Specify the output options.
    output: {
        // Specify the output file.                        (default = stdout)
        file: "results.json",
        
        // Specify the output format.                      (default = json)
        // Options: "json" | "ndjson"
        format: "json",
    },
};

export default function ({ doc, url, absoluteURL }) {
    // doc              - Contains the parsed HTML document
    // url              - Contains the scraped URL
    // absoluteURL(...) - Transforms relative URLs into absolute URLs
}

Query API

// <div class="element" foo="bar">Hey</div>
const el = doc.find(".element")
el.text()                                 // "Hey"
el.html()                                 // `<div class="element">Hey</div>`
el.attr("foo")                            // "bar"
el.hasAttr("foo")                         // true
el.hasClass("element")                    // true

// <ul>
//   <li class="a">Item 1</li>
//   <li>Item 2</li>
//   <li>Item 3</li>
// </ul>
const list = doc.find("ul")
list.children()                           // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]

const items = list.find("li")
items.length()                            // 3
items.first()                             // <li>Item 1</li>
items.last()                              // <li>Item 3</li>
items.get(1)                              // <li>Item 2</li>
items.get(1).prev()                       // <li>Item 1</li>
items.get(1).next()                       // <li>Item 3</li>
items.get(1).parent()                     // <ul>...</ul>
items.get(1).siblings()                   // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]
items.map(item => item.text())            // ["Item 1", "Item 2", "Item 3"]
items.filter(item => item.hasClass("a"))  // [<li class="a">Item 1</li>]

Flyscrape API

Document Parsing

import { parse } from "flyscrape";

const doc = parse(`<div class="foo">bar</div>`);
const text = doc.find(".foo").text();

File Downloads

import { download } from "flyscrape/http";

download("http://example.com/image.jpg")              // downloads as "image.jpg"
download("http://example.com/image.jpg", "other.jpg") // downloads as "other.jpg"
download("http://example.com/image.jpg", "dir/")      // downloads as "dir/image.jpg"

// If the server offers a filename via the Content-Disposition header and no
// destination filename is provided, Flyscrape will honor the suggested filename.
// E.g. `Content-Disposition: attachment; filename="archive.zip"`
download("http://example.com/generate_archive.php", "dir/") // downloads as "dir/archive.zip"

Issues and Suggestions

If you encounter any issues or have suggestions for improvement, please submit an issue.

Documentation ¶

Index ¶

Constants
Variables
func Dev(file string, overrides map[string]any) error
func Document(sel *goquery.Selection) map[string]any
func DocumentFromString(s string) (map[string]any, error)
func MockResponse(statusCode int, html string) (*http.Response, error)
func RegisterModule(mod Module)
func Run(file string, overrides map[string]any) error
func Watch(path string, fn func(string) error) error
type Config
type Context
type Exports
- func Compile(src string, imports Imports) (Exports, error)
- func (e Exports) Config() []byte
- func (e Exports) Scrape(p ScrapeParams) (any, error)
type Finalizer
type Imports
- func NewJSLibrary(client *http.Client) (imports Imports, wait func())
type Module
- func LoadModules(cfg Config) []Module
type ModuleInfo
type Provisioner
type Request
type RequestBuilder
type RequestValidator
type Response
type ResponseReceiver
type RoundTripFunc
- func MockTransport(statusCode int, html string) RoundTripFunc
- func (f RoundTripFunc) RoundTrip(r *http.Request) (*http.Response, error)
type ScrapeFunc
type ScrapeParams
type Scraper
- func NewScraper() *Scraper
- func (s *Scraper) MarkUnvisited(url string)
- func (s *Scraper) MarkVisited(url string)
- func (s *Scraper) Run()
- func (s *Scraper) ScriptName() string
- func (s *Scraper) Visit(url string)
type TransformError
- func (err TransformError) Error() string
type TransportAdapter

Constants ¶

View Source

const HeaderBypassCache = "X-Flyscrape-Bypass-Cache"

Variables ¶

View Source

var ScriptTemplate []byte

View Source

var StopWatch = errors.New("stop watch")

View Source

var Version string

Functions ¶

func Dev ¶ added in v0.4.0

func Dev(file string, overrides map[string]any) error

func Document ¶ added in v0.4.0

func Document(sel *goquery.Selection) map[string]any

func DocumentFromString ¶ added in v0.4.0

func DocumentFromString(s string) (map[string]any, error)

func MockResponse ¶ added in v0.2.0

func MockResponse(statusCode int, html string) (*http.Response, error)

func RegisterModule ¶ added in v0.2.0

func RegisterModule(mod Module)

func Run ¶ added in v0.4.0

func Run(file string, overrides map[string]any) error

func Watch ¶

func Watch(path string, fn func(string) error) error

Types ¶

type Config ¶ added in v0.2.0

type Config []byte

type Context ¶ added in v0.2.0

type Context interface {
	ScriptName() string
	Visit(url string)
	MarkVisited(url string)
	MarkUnvisited(url string)
}

type Exports ¶ added in v0.4.0

type Exports map[string]any

func Compile ¶

func Compile(src string, imports Imports) (Exports, error)

func (Exports) Config ¶ added in v0.4.0

func (e Exports) Config() []byte

func (Exports) Scrape ¶ added in v0.4.0

func (e Exports) Scrape(p ScrapeParams) (any, error)

type Finalizer ¶ added in v0.2.0

type Finalizer interface {
	Finalize()
}

type Imports ¶ added in v0.4.0

type Imports map[string]map[string]any

func NewJSLibrary ¶ added in v0.4.0

func NewJSLibrary(client *http.Client) (imports Imports, wait func())

type Module ¶ added in v0.2.0

type Module interface {
	ModuleInfo() ModuleInfo
}

func LoadModules ¶ added in v0.2.0

func LoadModules(cfg Config) []Module

type ModuleInfo ¶ added in v0.2.0

type ModuleInfo struct {
	ID  string
	New func() Module
}

type Provisioner ¶ added in v0.2.0

type Provisioner interface {
	Provision(Context)
}

type Request ¶ added in v0.2.0

type Request struct {
	Method  string
	URL     string
	Headers http.Header
	Cookies http.CookieJar
	Depth   int
}

type RequestBuilder ¶ added in v0.2.0

type RequestBuilder interface {
	BuildRequest(*Request)
}

type RequestValidator ¶ added in v0.2.0

type RequestValidator interface {
	ValidateRequest(*Request) bool
}

type Response ¶ added in v0.2.0

type Response struct {
	StatusCode int
	Headers    http.Header
	Body       []byte
	Data       any
	Error      error
	Request    *Request

	Visit func(url string)
}

type ResponseReceiver ¶ added in v0.2.0

type ResponseReceiver interface {
	ReceiveResponse(*Response)
}

type RoundTripFunc ¶ added in v0.2.0

type RoundTripFunc func(*http.Request) (*http.Response, error)

func MockTransport ¶ added in v0.2.0

func MockTransport(statusCode int, html string) RoundTripFunc

func (RoundTripFunc) RoundTrip ¶ added in v0.2.0

func (f RoundTripFunc) RoundTrip(r *http.Request) (*http.Response, error)

type ScrapeFunc ¶

type ScrapeFunc func(ScrapeParams) (any, error)

type ScrapeParams ¶

type ScrapeParams struct {
	HTML string
	URL  string
}

type Scraper ¶

type Scraper struct {
	ScrapeFunc ScrapeFunc
	Script     string
	Modules    []Module
	Client     *http.Client
	// contains filtered or unexported fields
}

func NewScraper ¶ added in v0.2.0

func NewScraper() *Scraper

func (*Scraper) MarkUnvisited ¶ added in v0.2.0

func (s *Scraper) MarkUnvisited(url string)

func (*Scraper) MarkVisited ¶ added in v0.2.0

func (s *Scraper) MarkVisited(url string)

func (*Scraper) Run ¶ added in v0.2.0

func (s *Scraper) Run()

func (*Scraper) ScriptName ¶ added in v0.2.0

func (s *Scraper) ScriptName() string

func (*Scraper) Visit ¶ added in v0.2.0

func (s *Scraper) Visit(url string)

type TransformError ¶

type TransformError struct {
	Line   int
	Column int
	Text   string
}

func (TransformError) Error ¶

func (err TransformError) Error() string

type TransportAdapter ¶ added in v0.2.0

type TransportAdapter interface {
	AdaptTransport(http.RoundTripper) http.RoundTripper
}

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
flyscrape
modules
browser
cache
cookies
depth
domainfilter
followlinks
headers
hook
output/json
output/ndjson
proxy
ratelimit
retry
starturl
urlfilter

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Demo

Features

Overview

Example

Installation

Recommended

Homebrew

Pre-compiled binary

Compile from source

Usage

Configuration

Query API

Flyscrape API

Document Parsing

File Downloads

Issues and Suggestions

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Dev ¶ added in v0.4.0

func Document ¶ added in v0.4.0

func DocumentFromString ¶ added in v0.4.0

func MockResponse ¶ added in v0.2.0

func RegisterModule ¶ added in v0.2.0

func Run ¶ added in v0.4.0

func Watch ¶

Types ¶

type Config ¶ added in v0.2.0

type Context ¶ added in v0.2.0

type Exports ¶ added in v0.4.0

func Compile ¶

func (Exports) Config ¶ added in v0.4.0

func (Exports) Scrape ¶ added in v0.4.0

type Finalizer ¶ added in v0.2.0

type Imports ¶ added in v0.4.0

func NewJSLibrary ¶ added in v0.4.0

type Module ¶ added in v0.2.0

func LoadModules ¶ added in v0.2.0

type ModuleInfo ¶ added in v0.2.0

type Provisioner ¶ added in v0.2.0

type Request ¶ added in v0.2.0

type RequestBuilder ¶ added in v0.2.0

type RequestValidator ¶ added in v0.2.0

type Response ¶ added in v0.2.0

type ResponseReceiver ¶ added in v0.2.0

type RoundTripFunc ¶ added in v0.2.0

func MockTransport ¶ added in v0.2.0

func (RoundTripFunc) RoundTrip ¶ added in v0.2.0

type ScrapeFunc ¶

type ScrapeParams ¶

type Scraper ¶

func NewScraper ¶ added in v0.2.0

func (*Scraper) MarkUnvisited ¶ added in v0.2.0

func (*Scraper) MarkVisited ¶ added in v0.2.0

func (*Scraper) Run ¶ added in v0.2.0

func (*Scraper) ScriptName ¶ added in v0.2.0

func (*Scraper) Visit ¶ added in v0.2.0

type TransformError ¶

func (TransformError) Error ¶

type TransportAdapter ¶ added in v0.2.0

Source Files ¶

Directories ¶