reactor-crw

command module
v0.0.0-...-f41957a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 21, 2021 License: MIT Imports: 2 Imported by: 0

README

Reactor Crawler

reactor-crw Go Report Card release

Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any kind of pagination... he'll go through all pages as well unless you'll tell him to not.

reactor_crawler_example

Quick start

Here's the quickest way to download something and test the crawler:

  • Download the latest build according to your OS from here.
  • Pick some URL from Joyreactor.
  • Run the crawler $ reactor-crw html -p "http://joyreactor.cc/tag/digital+art"

Crawler types

This crawler supports both HTML and graphQL API endpoints. As you saw in the example above we specified html subcommand to use exactly HTML crawler.

HTML crawler

It accepts a URL and crawl corresponding HTML response as well as all related pages through pagination. Basically it can accept any URL (tag, fandom, direct post URL or search results page). Here is a full list of HTML crawler flags:

$ reactor-crw html --help

Allows to quickly download all content by its direct url or entire tag or fandom from joyreactor.cc.
Example: reactor-crw -d "." -p "http://joyreactor.cc/tag/someTag/all" -w 2 -c "cookie-string"

Usage:
  reactor-crw html [flags]

Flags:
  -c, --cookie string        User's cookie. Some content may be unavailable without it
  -d, --destination string   Save path for content. Default value is a user's home folder
                             (example C:\Users\username for Windows) (default "/home/user")
  -h, --help                 help for html
  -p, --path string          Provide a full page URL
      --proxy string         HTTP proxy on given host + port. Example http://176.9.63.62:3128
  -s, --search string        A comma separated list of content types that should be downloaded.
                             Possible values: image,gif,webm,mp4. Example: -s "image,webm" (default "image,gif")
  -o, --single-page          Crawl only one page
      --socks5 string        SOCKS5 proxy on given host + port. Example socks5://127.0.0.1:9050
  -w, --workers int          Amount of workers (default 1)

From all flags only -p --path is required. All other flags can be omitted and default values will be used.

API crawler

This crawler uses a graphQL schema of joyreactor and retrieves content using direct requests. Unlike the HTML crawler, it can accept tag names only (for now). Here is a previous example but using the API: $ reactor-crw api -t "digital art". Here is a full list of API crawler flags:

$ reactor-crw api --help

Allows to quickly download all content by its tag name using joyreactor's graphQL API endpoint.
Apart from html crawler it uses a user's cookie not to fetch restricted content but for filtering it.
By default all tag's content will be fetched. Example: reactor-crw api -d "." -t "tagName" -w 2 -c "cookie-string"

Usage:
  reactor-crw api [flags]

Flags:
  -c, --cookie string        User's cookie. Used to apply content filtration based on the user tags preferences
  -d, --destination string   Save path for content. Default value is a user's home folder
                             (example C:\Users\username for Windows) (default "/home/user")
  -h, --help                 help for api
      --nsfw                 Include NSFW content to the search (default true)
      --proxy string         HTTP proxy on given host + port. Example http://176.9.63.62:3128
  -s, --search string        A comma separated list of content types that should be downloaded.
                             Possible values: image,gif. Example: -s "image" (default "image,gif")
      --socks5 string        SOCKS5 proxy on given host + port. Example socks5://127.0.0.1:9050
  -t, --tag string           Tag name
      --type string          Content type. Possible values are: new, good, best, all (default "good")
  -w, --workers int          Amount of workers (default 1)

As you can see html and api have a lot of flags in common but some of them are used for different purposes. For example if --cookie flag in html crawler allows fetching restricted content, the api from the other hand uses cookies to apply results filtration since API returns all posts as it is without any restrictions.

Please note. Joyreactor's API has strict limitations on requests per second from one host. To avoid a quick blocking the API crawler uses a 2-second delay between each request. For example, if a tag has 20 pages the amount of time needed for only crawling will be ~40 seconds, and only after that downloading process will start.

Examples

Download all tag's gifs to the user's downloads folder:

$ reactor-crw -p "http://joyreactor.cc/tag/tagname" -d "~/Downloads" -s "gif"

Using user's cookies download only images and mp4 from a specific post to the current directory:

$ reactor-crw -p "http://joyreactor.cc/post/postid" -d "." -s "image,mp4" -c "users-cookie"

Download images and gifs from search results to the user's home folder but also speed up download process with additional workers:

$ reactor-crw -p "http://joyreactor.cc/search?q=query&user=&tags=tag" -w 3

Note: some content may be parsed only with user's cookie.

FAQ

Q: What pages can I pass to the crawler?
A: Any joyreactor's pages except top posts pages. It can accept links to tag, favorites, fandom, post, and search results.

Q: Why some images have not been downloaded?
A: Some content may require user's cookie -c, --cookie, since it's hidden from not registered users.

Q: What does the -w (--workers) flag do?
A: This flag set amount of workers which will download content for you. But if you'll set too high value the joyreactor's server will notice frequent requests from you and will block you for some time. For a lot of content (~5000 images) set -w up to 5. For few images (~200) 2 workers will be enough.

Q: I've faced an unexpected error during using crawler client. What can I do?
A: Please create a bug report and describe an error including your search request params and all flags that have been used.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
api
handler
fs

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL