gh-archive-clickhouse

module
v0.0.0-...-b0480a5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2024 License: Apache-2.0

README

gh-archive-clickhouse

Save public GitHub event stream to ClickHouse as json.

CREATE TABLE github_events_raw
(
    id  Int64,
    ts  DateTime32,
    raw String CODEC (ZSTD(16))
) ENGINE = ReplacingMergeTree
      PARTITION BY toYYYYMMDD(ts)
      ORDER BY (ts, id);

Alternative to gharchive crawler with decreased probability to miss events.

  • Streaming to ClickHouse via native protocol instead of using files, so storage and fethching are decoupled.
  • Automatic pagination if more than one page of new events is available
  • Automatic fetch rate adjustment based on rate limit github headers and request duration
  • ETag support to skip cached results

gh-load

$ gh-load --help
Usage of gh-load:
  -b int
    	max token bytes (default 104857600)
  -db string
    	db
  -from string
    	start from (default "2022-03-14T21")
  -host string
    	host (default "localhost")
  -jobs int
    	jobs (default 1)
  -port int
    	port (default 9000)
  -table string
    	table (default "github_events_raw")
  -to string
    	end with (default "2022-04-14T21")

Example usage (consumes ~340MB RAM per job):

gh-load --db faster --from 2020-01-01T00 --to 2020-01-20T00 --jobs 20

Directories

Path Synopsis
cmd
internal
gh
Package gh is GitHub client.
Package gh is GitHub client.
zctx
Package zctx is a context-aware zap logger.
Package zctx is a context-aware zap logger.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL