scradstxt

command module
v0.0.0-...-e15bbeb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 15, 2022 License: MIT Imports: 20 Imported by: 0

README

Collects and parses ads.txt

GoLang program scrapes sites for ads.txt and stores its significant details to PostgreSQL database.

Give it a file with CSV list of sites to check (rank,site.url). I use top 1M sites from https://tranco-list.eu/top-1m.csv.zip For demonstration smaller top-1k.csv is supplied.

Scraper first checks HTTPS schema, if connection fails then fallback to HTTP. User-agent is spoofed. Timeout is 5 sec defined by const crawlerTimeout.

User who runs this program must have a ROLE in PostgreSQL allowing SELECT, INSERT, DELETE queries on working database. Program connects to the database via unix socket. Adjust dbConnectionString constant if TCP or another DB name or another authentication method used. PostgreSQL database is named adstxt.

sudo -u postgres psql -c 'CREATE DATABASE adstxt'

Create tables in it with the mktables.sql script.

psql -d adstxt < mktables.sql

Run the program with

go run main.go top-1k.csv

or build executable first

go build main.go
./main top-1k.csv

By default 64 goroutines run to fetch ads.txt from sites. This number can be increased for fast machines on fast connections with optional argument after the file name.

go run main.go top-1k.csv 1000

The third argument is continuation flag. If previous scraping was not finished, it's possible to continue scraping in the next run of the program by specifying flag c - continue. As arguments are positional then goroutines count parameter becomes mandatory for continuation flag to work.

go run main.go top-1k.csv 64 c

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL