vozer

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 22, 2018 License: MIT Imports: 24 Imported by: 0

README

vozer

CLI to crawl images and URLs from VOZ (https://forums.voz.vn) thread.

vozer_cli

Install

You can download compiled versions of vozer for Linux, Windows and Mac OS X from Github release.

Or install Go SDK and build it.

$ go get github.com/lnquy/vozer/cmd/vozer

or

$ go get github.com/lnquy/vozer
$ cd $GOPATH/github.com/lnquy/vozer
$ dep ensure
$ cd cmd/vozer
$ go build
Usage
$ vozer -h
Usage of vozer:
  -ci
    	Crawls images from posts or not
  -cu
    	Crawls URLs from posts or not
  -debug
    	Print debug log
  -o string
    	Path to directory where crawled data be saved to
  -pages string
    	List of page numbers to crawl data, separated by comma (,)
  -r uint
    	Number of time to re-crawl page if failed (default 20)
  -range string
    	Page range to crawl data, separated by hyphen (-) (default "0-0")
  -u string
    	URL to VOZ thread
  -v	Print vozer version
  -w uint
    	Number of workers to crawl data (default 10)

By default, vozer crawls all pages from thread and stores crawled data to data folder at current directory.
Uses -o argument to save data on another folder.
If you want to crawl specific page(s) then you can use -pages or -range argument.

Examples:

$ vozer -u https://forums.voz.vn/showthread.php?t=7382418 -ci   // Crawls images only
$ vozer -u https://forums.voz.vn/showthread.php?t=7382418 -cu -ci   // Crawls both images and URLs
$ vozer -u https://forums.voz.vn/showthread.php?t=7382418 -cu -ci -pages 1,3,10   // Crawls page 1, 3 and 10 only
$ vozer -u https://forums.voz.vn/showthread.php?t=7382418 -cu -ci -range 5-9   // Crawls from page 5 to page 9
$ vozer -u https://forums.voz.vn/showthread.php?t=7382418 -cu -ci -w 20 -r 10 -o ~/Desktop/voz -debug   // Output to ~/Desktop/voz folder rather than current directory
License

This project is under the MIT License. See the LICENSE file for the full license text.

Roadmap
  • Crawl pages by range (from page x to page y).
  • Crawl pages by list of numbers (input from CLI args or from file).
  • Filter emoticons to separated folder.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Crawl

func Crawl(ctx context.Context, cfg VozerConfig) error

Types

type CrawledMeta

type CrawledMeta struct {
	Success []uint `json:"success"`
	Failed  []uint `json:"failed"`
}

type CrawledPageMeta

type CrawledPageMeta struct {
	PageNumber uint
	Document   *goquery.Document
}

type ImageMeta

type ImageMeta struct {
	URL      string `json:"url"`
	Filename string `json:"filename"`
	Seen     int    `json:"seen"`
	AtPosts  []int  `json:"at_posts"`
}

type PageURLMeta

type PageURLMeta struct {
	URL        string
	PageNumber uint
	Retries    uint
}

type Report

type Report struct {
	Config  VozerConfig `json:"config"`
	Crawled CrawledMeta `json:"crawled"`
}

type URLMeta

type URLMeta struct {
	URL     string `json:"url"`
	Text    string `json:"text"`
	Seen    int    `json:"seen"`
	AtPosts []int  `json:"at_posts"`
}

type VozerConfig

type VozerConfig struct {
	ThreadURL     string `json:"thread_url"`
	NuWorkers     uint   `json:"workers"`
	IsCrawlURLs   bool   `json:"is_crawl_urls"`
	IsCrawlImages bool   `json:"is_crawl_images"`
	DestPath      string `json:"destination_path"`
	Retries       uint   `json:"retries"`
	CrawlPages    []uint `json:"crawl_pages"`
	CrawlFromPage uint   `json:"crawl_from_page"`
	CrawlToPage   uint   `json:"crawl_to_page"`
}

func (*VozerConfig) Validate

func (c *VozerConfig) Validate() error

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL