dataflowkit

module
v0.0.0-...-d33463d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 12, 2020 License: BSD-3-Clause

README

Dataflow kit

alt tag

Build Status GoDoc Go Report Card codecov

Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors.

You can use it in many ways for data mining, data processing or archiving.

The Web Scraping Pipeline

Web-scraping pipeline consists of 3 general components:

  • Downloading an HTML web-page. (Fetch Service)
  • Parsing an HTML page and retrieving data we're interested in (Parse Service)
  • Encoding parsed data to CSV, MS Excel, JSON, JSON Lines or XML format.

Fetch service

fetch.d server is intended for html web pages content download. Depending on Fetcher type, web page content is downloaded using either Base Fetcher or Chrome fetcher.

Base fetcher uses standard golang http client to fetch pages as is. It works faster than Chrome fetcher. But Base fetcher cannot render dynamic javascript driven web pages.

Chrome fetcher is intended for rendering dynamic javascript based content. It sends requests to Chrome running in headless mode.

A fetched web page is passed to parse.d service.

Parse service

parse.d is the service that extracts data from downloaded web page following the rules listed in configuration JSON file. Extracted data is returned in CSV, MS Excel, JSON or XML format.

Note: Sometimes Parse service cannot extract data from some pages retrieved by default Base fetcher. Empty results may be returned while parsing Java Script generated pages. Parse service then attempts to force Chrome fetcher to render the same dynamic javascript driven content automatically. Have a look at https://scrape.dataflowkit.com/persons/page-0 which is a sample of JavaScript driven web page.

Dataflow kit benefits:

  • Scraping of JavaScript generated pages;

  • Data extraction from paginated websites;

  • Processing infinite scrolled pages.

  • Sсraping of websites behind login form;

  • Cookies and sessions handling;

  • Following links and detailed pages processing;

  • Managing delays between requests per domain;

  • Following robots.txt directives;

  • Saving intermediate data in Diskv or Mongodb. Storage interface is flexible enough to add more storage types easily;

  • Encode results to CSV, MS Excel, JSON(Lines), XML formats;

  • Dataflow kit is fast. It takes about 4-6 seconds to fetch and then parse 50 pages.

  • Dataflow kit is suitable to process quite large volumes of data. Our tests show the time needed to parse appr. 4 millions of pages is about 7 hours. 

Installation

go get -u github.com/slotix/dataflowkit

Usage

Docker
  1. Install Docker and Docker Compose

  2. Start services.

cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose up

This command fetches docker images automatically and starts services.

  1. Launch parsing in the second terminal window by sending POST request to parse daemon. Some json configuration files for testing are available in /examples folder.
curl -XPOST  127.0.0.1:8001/parse --data-binary "@$GOPATH/src/github.com/slotix/dataflowkit/examples/books.toscrape.com.json"

Here is the sample json configuration file:

{
	"name":"collection",
	"request":{
	   "url":"https://example.com"
	},
	"fields":[
	   {
		  "name":"Title",
		  "selector":".product-container a",
		  "extractor":{
			 "types":["text", "href"],
			 "filters":[
				"trim",
				"lowerCase"
			 ],
			 "params":{
				"includeIfEmpty":false
			 }
		  }
	   },
	   {
		  "name":"Image",
		  "selector":"#product-container img",
		  "extractor":{
			 "types":["alt","src","width","height"],
			 "filters":[
				"trim",
				"upperCase"
			 ]
		  }
	   },
	   {
		  "name":"Buyinfo",
		  "selector":".buy-info",
		  "extractor":{
			 "types":["text"],
			 "params":{
				"includeIfEmpty":false
			 }
		  }
	   }
	],
	"paginator":{
	   "selector":".next",
	   "attr":"href",
	   "maxPages":3
	},
	"format":"json",
	"fetcherType":"chrome",
	"paginateResults":false
}

Read more information about scraper configuration JSON files at our GoDoc reference

Extractors and filters are described at https://godoc.org/github.com/slotix/dataflowkit/extract

  1. To stop services just press Ctrl+C and run
cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose down --remove-orphans --volumes

IMAFGE ALT CLI Dataflow kit web scraping framework

Click on image to see CLI in action.

Manual way
  1. Start Chrome docker container
docker run --init -it --rm -d --name chrome --shm-size=1024m -p=127.0.0.1:9222:9222 --cap-add=SYS_ADMIN \
  yukinying/chrome-headless-browser

Headless Chrome is used for fetching web pages to feed a Dataflow kit parser.

  1. Build and run fetch.d service
cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/fetch.d && go build && ./fetch.d
  1. In new terminal window build and run parse.d service
cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/parse.d && go build && ./parse.d
  1. Launch parsing. See step 3. from the previous section.
Run tests
  • docker-compose -f test-docker-compose.yml up -d
  • ./test.sh
  • To stop services just run docker-compose -f test-docker-compose.yml down

Front-End

Try https://dataflowkit.com/dfk Front-end with Point-and-click interface to Dataflow kit services. It generates JSON config file and sends POST request to DFK Parser

IMAGE ALT Dataflow kit web scraping framework

Click on image to see Dataflow kit in action.

License

This is Free Software, released under the BSD 3-Clause License.

Contributing

You are welcome to contribute to our project.

alt tag

Directories

Path Synopsis
cmd
Package cmd of the Dataflow kit contains the following CLI daemons: - fetch.d service downloads html content from web pages to feed Dataflow kit scrapers.
Package cmd of the Dataflow kit contains the following CLI daemons: - fetch.d service downloads html content from web pages to feed Dataflow kit scrapers.
fetch.cli
Fetcher CLI of the Dataflow kit downloads html content from web pages via Fetcher service endpoint.
Fetcher CLI of the Dataflow kit downloads html content from web pages via Fetcher service endpoint.
fetch.d
Fetcher service of the Dataflow kit downloads html content from web pages to feed Dataflow kit scrapers.
Fetcher service of the Dataflow kit downloads html content from web pages to feed Dataflow kit scrapers.
parse.d
Parse service of the Dataflow kit parses html content from web pages following the rules described in configuration JSON file.
Parse service of the Dataflow kit parses html content from web pages following the rules described in configuration JSON file.
Package errs of the Dataflow kit lists specific error types like ParseError, BadPayload.
Package errs of the Dataflow kit lists specific error types like ParseError, BadPayload.
Package fetch of the Dataflow kit is used by fetch.d service which downloads html content from web pages to feed Dataflow kit scrapers.
Package fetch of the Dataflow kit is used by fetch.d service which downloads html content from web pages to feed Dataflow kit scrapers.
Package healthcheck of the Dataflow kit checks if specified services are alive.
Package healthcheck of the Dataflow kit checks if specified services are alive.
Package parse of the Dataflow kit is used by parse.d service which parses html content from web pages following the rules described in Payload JSON file.
Package parse of the Dataflow kit is used by parse.d service which parses html content from web pages following the rules described in Payload JSON file.
Package scrape of the Dataflow kit is for structured data extraction from webpages starting from JSON payload processing to encoding scraped data to one of output formats like JSON, Excel, CSV, XML
Package scrape of the Dataflow kit is for structured data extraction from webpages starting from JSON payload processing to encoding scraped data to one of output formats like JSON, Excel, CSV, XML
Package storage of the Dataflow kit describes Store interface for read/ write operations with downloaded data and parsed results.
Package storage of the Dataflow kit describes Store interface for read/ write operations with downloaded data and parsed results.
Package utils of the Dataflow kit includes various functions and helpers to be used by other packages.
Package utils of the Dataflow kit includes various functions and helpers to be used by other packages.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL