mercrawl

package module
v0.0.0-...-cfff224 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 16, 2017 License: MIT Imports: 15 Imported by: 0

README

mercrawl

mercrawl crawls Mercari items of your search condition and send you the result by email.

Getting started

  1. Install Postgresql
  2. Set up database. Refer to migrate.sql
  3. Set up environment variables. See Environment Variables below. You can utilize setenv.sh template for your convenience.

Usage

Try it ASAP!

After set up the environment,

go run mercrawl/mercrawl.go "sort_order=&keyword=iphone+x&category_root=7&category_child=100&category_grand_child%5B859%5D=1&brand_name=&brand_id=&size_group=&price_min=60000&price_max=&item_condition_id%5B1%5D=1&item_condition_id%5B2%5D=1&status_on_sale=1" & # start crawler
go run mermail/mermail.go your_mail_addr & # start mailer
go run rest-api/merest.go & # start rest api server

To quickly quit mercrawl, mermail and merest background processes,

kill %1 %2 %3
Crawler

Usage:

mercrawl <search_condition>

Example: search on sale PS4 Pro with category of "家庭用ゲーム本体" and price range ¥30,000 ~ ¥45,000

mercrawl "keyword=ps4+pro&category_root=5&category_child=76&category_grand_child%5B701%5D=1&price_min=30000&price_max=45000&status_on_sale=1"

WARNING A too generic search condition that have too many pages of result may cause your IP address banned by Mercari. Please make your search condition be as precise as possible.

Mailer

Usage:

mermail <mail_addr>

You will receive email like this if mercrawl successfully scrape new items

email sent from mermail

RESTful API Server

Usage:

merest

After the server started, you can access the following resources in JSON:

GET /items
GET /item/{id}

Environment Variables

Global configurations:

  • USER: database username
  • SSLMODE: disable or verify-full

Crawler configurations:

  • PAGE_WORKERS(optional): max goroutine number for crawling a search result page. Default value is 5
  • ITEM_WORKERS(optional): max goroutine number for crawling an item page. Default value is 20
  • RECRAWL_INTERVAL(optional): interval of re-crawl with the same search condition. Default interval is 30 seconds if the variable is not set.

Mailer configurations:

  • INTERVAL(optional): interval of sending new item info in seconds. Default interval is 30 seconds if the variable is not set.
  • SMTP_SERVER: mail server address
  • SMTP_PORT: mail server port
  • SMTP_USER: mail server login user name
  • SMTP_PWD: mail server login password

RESTful API Server configurations:

  • REST_PORT(optional): rest-api server port. default is 8000

Dependency

  • Postgresql

Performance Tuning

Tune the itemWorkers and pageWorkers parameters to achieve a better performance for your environment.

Documentation

Overview

Package mercrawl crawls pages from a begin point and the following pages in parallel

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetAttr

func GetAttr(t html.Token, attr string) (ok bool, val string)

GetAttr get href value from html token

func GetDb

func GetDb() *sql.DB

GetDb returns an instance of Postgresql database connection.

func Mail

func Mail(to string)

Mail scans database and send new item info to the mail address

func MarkAsSent

func MarkAsSent(items []Item)

MarkAsSent marks items as sent

func ParsePrice

func ParsePrice(s string) (price int)

ParsePrice parse a JPY price like ¥ 168,800 to float32 168800.0

func Start

func Start(search string)

Start starts crawling all items of the search result page with search condition string

func WaitInterupt

func WaitInterupt()

WaitInterupt wait an interupt signal ^c to end the program

Types

type Item

type Item struct {
	ID          string    `json:"id"`
	Name        string    `json:"name"`
	Photos      [4]string `json:"photos,omitempty"`
	Status      string    `json:"status,omitempty"`
	Price       int       `json:"price,omitempty"`
	ShippingFee string    `json:"shippingFee,omitempty"`
	Description string    `json:"description,omitempty"`
	URL         string    `json:"url,omitempty"`
	Sent        bool      `json:"sent,omitempty"`
}

Item represents a mercari item

func GetAllItems

func GetAllItems() (items []Item)

GetAllItems gets all items from database

func GetItem

func GetItem(id string) (item Item)

GetItem gets one item from database

func GetUnsentItems

func GetUnsentItems() (items []Item)

GetUnsentItems gets all unsent items from database

func (*Item) Exists

func (item *Item) Exists() bool

Exists checks if an item is already in database

func (*Item) Save

func (item *Item) Save()

Save persists the item to database

type PageState

type PageState struct {
	Mux *sync.Mutex
	// contains filtered or unexported fields
}

PageState stores crawled pages

func (*PageState) Get

func (ps *PageState) Get(index string) bool

Get provide the crawled state of a certain page

func (*PageState) Set

func (ps *PageState) Set(index string)

Set marks a page as stored

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL