webpluck

module

v0.0.0-...-7dd0deb Latest Latest Go to latest Published: Oct 13, 2021 License: MIT

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/codeexpress/webpluck

Links

Open Source Insights

README ¶

Web Plucker

webpluck scrapes a specific values from a web page. It works as a standalone binary as well as in a API mode.

Download

Latest Releases

webpluck is available for 64 bit linux, OS X and Windows systems. Latest versions can be downloaded from the Release tab above. This is the preferred way.

Build from source

This is a golang project. Assuming you have golang compiler installed, the following will build the binary from scratch

$ git clone https://github.com/codeexpress/webpluck
$ cd webpluck
$ go build -o webpluck cmd/main.go

Usage

webpluck takes the following input:

URL of the webpage
XPATH of the element
optional regex to further narrow the selection

and outputs the selected value.

1. URL of the webpage

This is the link of the webpage which has the desired information we want to extract.

For example, if we want to scrape the founders of the StackOverflow website from its company page, the URL is: https://stackoverflow.com/company. The desired value we want to extract is: Joel Spolsky and Jeff Atwood

2. XPATH of the element

This is the link of the xpath of the element on the page that contains the required information. A good way to get this information is to (on a chrome browser):

Right click on the place were the information is present
Click "Inspect" to open the Chrome developer tools window with the element highligted
On the highlighed value in the HTML source code, Right click -> Copy -> Copy xpath
The copied value is the xpath we need

Get xpath Step 1	Get xpath Step 2

The xpath in example above comes out to be: //*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()

3. `regex` to pluck the right value

Note the the xpath above leads us to the value: Joel Spolsky and Jeff Atwood launch Stack Overflow Since we want to trim that down further, we'll provide a regex value to extract just the names. This regex will fetch just the names (the value in parenthesis): ^(*.) launch .*

Sample hosted invocation

webpluck can be run as a standalone binary. To extract the names using the three params we just obtained, copy the targets.yml file and populate it with the parameters. The resulting targets.yml should look like this:

targetList:
  -   name: stackoverflow_founders
      baseUrl: https://stackoverflow.com/company
      xpath:  //*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()
      regex: ^(.*) launch .*

Now invoke webpluck as follows and obtain the answer:

$ ./webpluck_osx -f /path/to/targets.yml
{
  "stackoverflow_founders": "Joel Spolsky and Jeff Atwood"
}

Sample API invocation

webpluck can be run in server mode as well. Thereafter, clients written in other programming languages can scrape web pages using the webpluck API over the network.

To run webpluck in server mode listening on localhost on 8080:

$ ./webpluck -p 8080

An instance of webpluck API is running at https://api.code.express/webpluck/. You can use that for your light extraction needs. If your load is heavy, consider spinning your own server running webpluck

Armed with the knowledge of baseUrl, xpath and regex, we can now call the API endpoint by POSTing these three params: Example curl invocation for the server mode:

curl 'https://api.code.express/webpluck/' \
      --data-urlencode 'baseUrl=https://stackoverflow.com/company' \
      --data-urlencode 'xpath=//*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()' \
      --data-urlencode 'regex=^(.*) launch .*' -g

The result from the API is as follows. The pluckedData field returns the value extracted:

{
  "baseUrl": "https://stackoverflow.com/company",
  "pluckedData": "Joel Spolsky and Jeff Atwood",
  "regex": "^(.*) launch .*",
  "xpath": "//*[@id=\"content\"]/section[3]/ol/li[1]/ol/li[2]/text()"
}

Directories ¶

Path	Synopsis
cmd
logger
webpluck

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL