Reactor Crawler
Simple CLI content crawler for Joyreactor. He'll find all media content on the
page you've provided and save it. If there will be any kind of pagination... he'll go through all pages
as well unless you'll tell him to not.
Quick start
Here's the quickest way to download something and test the crawler:
- Download the latest build according to your OS from here.
- Pick some URL from Joyreactor.
- Run the crawler
$ reactor-crw html -p "http://joyreactor.cc/tag/digital+art"
Crawler types
This crawler supports both HTML and graphQL API endpoints. As you saw in the example above we specified html
subcommand to use exactly HTML crawler.
HTML crawler
It accepts a URL and crawl corresponding HTML response as well as all related pages through pagination.
Basically it can accept any URL (tag, fandom, direct post URL or search results page).
Here is a full list of HTML crawler flags:
$ reactor-crw html --help
Allows to quickly download all content by its direct url or entire tag or fandom from joyreactor.cc.
Example: reactor-crw -d "." -p "http://joyreactor.cc/tag/someTag/all" -w 2 -c "cookie-string"
Usage:
reactor-crw html [flags]
Flags:
-c, --cookie string User's cookie. Some content may be unavailable without it
-d, --destination string Save path for content. Default value is a user's home folder
(example C:\Users\username for Windows) (default "/home/user")
-h, --help help for html
-p, --path string Provide a full page URL
--proxy string HTTP proxy on given host + port. Example http://176.9.63.62:3128
-s, --search string A comma separated list of content types that should be downloaded.
Possible values: image,gif,webm,mp4. Example: -s "image,webm" (default "image,gif")
-o, --single-page Crawl only one page
--socks5 string SOCKS5 proxy on given host + port. Example socks5://127.0.0.1:9050
-w, --workers int Amount of workers (default 1)
From all flags only -p --path
is required. All other flags can be omitted and default values will be used.
API crawler
This crawler uses a graphQL schema of joyreactor and retrieves content using direct requests. Unlike the HTML crawler, it can accept tag names only (for now).
Here is a previous example but using the API: $ reactor-crw api -t "digital art"
.
Here is a full list of API crawler flags:
$ reactor-crw api --help
Allows to quickly download all content by its tag name using joyreactor's graphQL API endpoint.
Apart from html crawler it uses a user's cookie not to fetch restricted content but for filtering it.
By default all tag's content will be fetched. Example: reactor-crw api -d "." -t "tagName" -w 2 -c "cookie-string"
Usage:
reactor-crw api [flags]
Flags:
-c, --cookie string User's cookie. Used to apply content filtration based on the user tags preferences
-d, --destination string Save path for content. Default value is a user's home folder
(example C:\Users\username for Windows) (default "/home/user")
-h, --help help for api
--nsfw Include NSFW content to the search (default true)
--proxy string HTTP proxy on given host + port. Example http://176.9.63.62:3128
-s, --search string A comma separated list of content types that should be downloaded.
Possible values: image,gif. Example: -s "image" (default "image,gif")
--socks5 string SOCKS5 proxy on given host + port. Example socks5://127.0.0.1:9050
-t, --tag string Tag name
--type string Content type. Possible values are: new, good, best, all (default "good")
-w, --workers int Amount of workers (default 1)
As you can see html
and api
have a lot of flags in common but some of them are used for different purposes.
For example if --cookie
flag in html
crawler allows fetching restricted content, the api
from the other hand
uses cookies to apply results filtration since API returns all posts as it is without any restrictions.
⚠ Please note. Joyreactor's API has strict limitations on requests per second from one host.
To avoid a quick blocking the API crawler uses a 2-second delay between each request.
For example, if a tag has 20 pages the amount of time needed for only crawling will be
~40 seconds, and only after that downloading process will start.
Examples
Download all tag's gifs to the user's downloads
folder:
$ reactor-crw -p "http://joyreactor.cc/tag/tagname" -d "~/Downloads" -s "gif"
Using user's cookies download only images and mp4 from a specific post to the current directory:
$ reactor-crw -p "http://joyreactor.cc/post/postid" -d "." -s "image,mp4" -c "users-cookie"
Download images and gifs from search results to the user's home folder but also
speed up download process with additional workers:
$ reactor-crw -p "http://joyreactor.cc/search?q=query&user=&tags=tag" -w 3
Note: some content may be parsed only with user's cookie.
FAQ
Q: What pages can I pass to the crawler?
A: Any joyreactor's pages except top posts pages. It can accept links to tag, favorites, fandom, post, and search results.
Q: Why some images have not been downloaded?
A: Some content may require user's cookie -c, --cookie
, since it's hidden from not registered users.
Q: What does the -w (--workers)
flag do?
A: This flag set amount of workers which will download content for you. But if you'll set too high value
the joyreactor's server will notice frequent requests from you and will block you for some time.
For a lot of content (~5000 images) set -w
up to 5. For few images (~200) 2 workers will be enough.
Q: I've faced an unexpected error during using crawler client. What can I do?
A: Please create a bug report and describe an error including your search request params and all flags that have been used.