ggmbox

command module
v0.0.0-...-0393f8f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 10, 2018 License: MIT Imports: 15 Imported by: 0

README

ggmbox Build Status Build status Docker Build Status

Google Groups raw emails crawler and parser. Turbo speed and reliable! The downloaded messages are in RFC 822 format - taken verbatim from the Google servers.

Installation
Docker

Docker is the simplest option. Go to DockerHub Prepend docker run -it --rm vmarkovtsev/ggmbox to all the commands in the "Usage" section.

Crawler

Requirements: Python 3 and Scrapy. Download ggmbox.py file.

Parser

Requirements: Go.

go get -v github.com/vmarkovtsev/ggmbox
Usage
Crawler
scrapy runspider -a name=golang-nuts -o result.json -t json ggmbox.py

Replace "golang-nuts" with the actual group name. The raw emails will be saved by default to the corresponding directory.

scrapy runspider -a name=chromium-dev -a prefix=a/chromium.org -o result.json -t json ggmbox.py

Note the usage of "prefix" argument - it sets the name of the parent. Some groups require that.

Parser
./parse golang-nuts > dataset.csv

Replace "golang-nuts" with the actual directory name with raw emails. The plain text threads will be written to dataset.csv, one thread per line. Special characters are escaped.

Performance
Crawler

golang-nuts group was fully fetched on 24/02/2018 with 30043 topics and 192654 messages in 3 hours at 1gbps connection speed. The raw emails occupied 1.6 GB on disk.

Compare to 1 day using icy/google-group-crawler, it fetched only 63% and then stopped without any errors reported, or to henryk/gggd, it fetched only 3% within one hour and then unexpectedly stopped too.

Parser

It takes 7 seconds to parse 1.6 GB of raw emails on a 32-core machine.

Contributions

...are welcome! See CONTRIBUTING.md and CODE_OF_CONDUCT.md.

License

MIT.

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL