harvit

command module
v0.0.0-...-e601235 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 3, 2023 License: Apache-2.0 Imports: 4 Imported by: 0

README

harvit

Go Doc Conventional Commits License

Harvit harvests data from different sources (e.g websites, APIs), converts and transforms it.

Contents

Requirements

Usage

Harvit uses a plan in yaml format (see example) to define the data source, fields and the transformer to be performed.

$ ./harvit harvest [command options] plan
NAME:
   harvit harvest - Let's harvest some data!

USAGE:
   harvit harvest [command options] plan

OPTIONS:
   --debug     whether running in PROD or DEBUG mode (default: false) [$HARVIT_DEBUG]
   --help, -h  show help (default: false)

Example

$ ./harvit harvest | jq
plan.yml
source: https://mgjules.dev
type: website
fields:
  - name: firstJobName
    type: raw
    selector: "#experience > div:nth-child(2) > ul > li:nth-child(1) > div.flex.flex-wrap.items-center.justify-between > h3"
  - name: secondJobStartYear
    type: datetime
    selector: "#experience > div:nth-child(2) > ul > li:nth-child(2) > div.flex.flex-wrap.items-center.justify-between > span"
    regex: \d{2}/(\d{4})\s→
    format: Y
  - name: secondJobEndDateTime
    type: datetime
    selector: "#experience > div:nth-child(2) > ul > li:nth-child(2) > div.flex.flex-wrap.items-center.justify-between > span"
    regex: →\s(?:[a-zA-Z]+|(\d{2}/\d{4}))
    format: m/Y
    timezone: Indian/Mauritius
  - name: topLinks
    type: text
    selector: "body > div.relative.px-4.pt-4.sm\\:pt-16.print\\:pt-0.sm\\:px-6.lg\\:px-8 > div.max-w-4xl.mx-auto.text-lg > div:nth-child(2) > div.flex.flex-wrap.items-center.justify-center.gap-x-4.gap-y-2.print\\:hidden > a > div > span"
  - name: experiencePlaces
    type: text
    selector: "#experience > div:nth-child(2) > ul > li > div.flex.flex-wrap.items-center.justify-between > h3"
  - name: contributionsYears
    type: datetime
    selector: "#contributions > div:nth-child(2) > ul > li > div > span"
    regex: (\d{4})
    format: Y
  - name: contributionsYearsNumbers
    type: number
    selector: "#contributions > div:nth-child(2) > ul > li > div > span"
    regex: (\d{4})
  - name: interestsTitle
    type: text
    selector: "#interests > div:nth-child(2) > ul > li > span"
transformer: transformers/sample.js
transformers/sample.js
data['interestsTitle'] = data['interestsTitle'].map(v => v === 'Space Exploration' ? 'SpaceX' : v);
Result
{
  "contributionsYears": [
    "2022-01-01T00:00:00+04:00",
    "2021-01-01T00:00:00+04:00",
    "2020-01-01T00:00:00+04:00",
    "2020-01-01T00:00:00+04:00",
    "2019-01-01T00:00:00+04:00",
    "2019-01-01T00:00:00+04:00",
    "2019-01-01T00:00:00+04:00",
    "2019-01-01T00:00:00+04:00"
  ],
  "contributionsYearsNumbers": [
    2022,
    2021,
    2020,
    2020,
    2019,
    2019,
    2019,
    2019
  ],
  "experiencePlaces": [
    "Ringier SA",
    "Bocasay",
    "La Sentinelle Digital Ltd",
    "Expat-Blog Ltd",
    "Noveo IT Ltd"
  ],
  "firstJobName": "<h3 class=\"my-0\">Ringier SA</h3>",
  "interestsTitle": [
    "SpaceX",
    "Artificial Intelligence",
    "Skateboarding",
    "Anime",
    "Gaming",
    "Movie"
  ],
  "secondJobEndDateTime": "2021-02-01T00:00:00+04:00",
  "secondJobStartYear": "2020-01-01T00:00:00+04:00",
  "topLinks": [
    "Developer",
    "Github",
    "LinkedIn",
    "Mail",
    "Mauritius"
  ]
}

License

Harvit is Apache 2.0 licensed.

Stability

This project follows SemVer strictly and is not yet v1.

Breaking changes might be introduced until v1 is released.

This project follows the Go Release Policy. Each major version of Go is supported until there are two newer major releases.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL