reactor-crw

command module

v0.0.0-...-f41957a Latest Latest Go to latest Published: Nov 21, 2021 License: MIT Imports: 2 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

Reactor Crawler

Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any kind of pagination... he'll go through all pages as well unless you'll tell him to not.

reactor_crawler_example

Quick start

Here's the quickest way to download something and test the crawler:

Download the latest build according to your OS from here.
Pick some URL from Joyreactor.
Run the crawler $ reactor-crw html -p "http://joyreactor.cc/tag/digital+art"

Crawler types

This crawler supports both HTML and graphQL API endpoints. As you saw in the example above we specified html subcommand to use exactly HTML crawler.

HTML crawler

It accepts a URL and crawl corresponding HTML response as well as all related pages through pagination. Basically it can accept any URL (tag, fandom, direct post URL or search results page). Here is a full list of HTML crawler flags:

$ reactor-crw html --help

Allows to quickly download all content by its direct url or entire tag or fandom from joyreactor.cc.
Example: reactor-crw -d "." -p "http://joyreactor.cc/tag/someTag/all" -w 2 -c "cookie-string"

Usage:
  reactor-crw html [flags]

Flags:
  -c, --cookie string        User's cookie. Some content may be unavailable without it
  -d, --destination string   Save path for content. Default value is a user's home folder
                             (example C:\Users\username for Windows) (default "/home/user")
  -h, --help                 help for html
  -p, --path string          Provide a full page URL
      --proxy string         HTTP proxy on given host + port. Example http://176.9.63.62:3128
  -s, --search string        A comma separated list of content types that should be downloaded.
                             Possible values: image,gif,webm,mp4. Example: -s "image,webm" (default "image,gif")
  -o, --single-page          Crawl only one page
      --socks5 string        SOCKS5 proxy on given host + port. Example socks5://127.0.0.1:9050
  -w, --workers int          Amount of workers (default 1)

From all flags only -p --path is required. All other flags can be omitted and default values will be used.

API crawler

This crawler uses a graphQL schema of joyreactor and retrieves content using direct requests. Unlike the HTML crawler, it can accept tag names only (for now). Here is a previous example but using the API: $ reactor-crw api -t "digital art". Here is a full list of API crawler flags:

$ reactor-crw api --help

Allows to quickly download all content by its tag name using joyreactor's graphQL API endpoint.
Apart from html crawler it uses a user's cookie not to fetch restricted content but for filtering it.
By default all tag's content will be fetched. Example: reactor-crw api -d "." -t "tagName" -w 2 -c "cookie-string"

Usage:
  reactor-crw api [flags]

Flags:
  -c, --cookie string        User's cookie. Used to apply content filtration based on the user tags preferences
  -d, --destination string   Save path for content. Default value is a user's home folder
                             (example C:\Users\username for Windows) (default "/home/user")
  -h, --help                 help for api
      --nsfw                 Include NSFW content to the search (default true)
      --proxy string         HTTP proxy on given host + port. Example http://176.9.63.62:3128
  -s, --search string        A comma separated list of content types that should be downloaded.
                             Possible values: image,gif. Example: -s "image" (default "image,gif")
      --socks5 string        SOCKS5 proxy on given host + port. Example socks5://127.0.0.1:9050
  -t, --tag string           Tag name
      --type string          Content type. Possible values are: new, good, best, all (default "good")
  -w, --workers int          Amount of workers (default 1)

As you can see html and api have a lot of flags in common but some of them are used for different purposes. For example if --cookie flag in html crawler allows fetching restricted content, the api from the other hand uses cookies to apply results filtration since API returns all posts as it is without any restrictions.

⚠ Please note. Joyreactor's API has strict limitations on requests per second from one host. To avoid a quick blocking the API crawler uses a 2-second delay between each request. For example, if a tag has 20 pages the amount of time needed for only crawling will be ~40 seconds, and only after that downloading process will start.

Examples

Download all tag's gifs to the user's downloads folder:

$ reactor-crw -p "http://joyreactor.cc/tag/tagname" -d "~/Downloads" -s "gif"

Using user's cookies download only images and mp4 from a specific post to the current directory:

$ reactor-crw -p "http://joyreactor.cc/post/postid" -d "." -s "image,mp4" -c "users-cookie"

Download images and gifs from search results to the user's home folder but also speed up download process with additional workers:

$ reactor-crw -p "http://joyreactor.cc/search?q=query&user=&tags=tag" -w 3

Note: some content may be parsed only with user's cookie.

FAQ

Q: What pages can I pass to the crawler?
A: Any joyreactor's pages except top posts pages. It can accept links to tag, favorites, fandom, post, and search results.

Q: Why some images have not been downloaded?
A: Some content may require user's cookie -c, --cookie, since it's hidden from not registered users.

Q: What does the -w (--workers) flag do?
A: This flag set amount of workers which will download content for you. But if you'll set too high value the joyreactor's server will notice frequent requests from you and will block you for some time. For a lot of content (~5000 images) set -w up to 5. For few images (~200) 2 workers will be enough.

Q: I've faced an unexpected error during using crawler client. What can I do?
A: Please create a bug report and describe an error including your search request params and all flags that have been used.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
client
cmd
crawler
api
html
html/parser
handler
fs
http

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL