spindel

package module
v0.0.0-...-78bf94d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 11, 2021 License: MIT Imports: 19 Imported by: 0

README

Spindel

Experimental API server, takes requests for a given id and returns a result fused from OCI citations and index data.

ai-49-aHR0cD...TAuMTEwNC9wc...    10.1104/pp.88.4.1411           0   33   0.011371553
ai-49-aHR0cD...TAuMTc1NzYva...    10.17576/jsm-2019-4808-23      0   3    0.002403981
ai-49-aHR0cD...TAuMTYxNC93d...    10.1614/wt-08-045.1            19  12   0.006658463
ai-49-aHR0cD...TAuMzg5Ny96b...    10.3897/zookeys.449.6813       0   1    0.000609854
ai-49-aHR0cD...TAuMTA4OC8xN...    10.1088/1757-899x/768/5/052105 2   0    0.000913447
ai-49-aHR0cD...TAuNTgxMS9jc...    10.5811/cpcem.2019.7.43632     1   0    0.047257667
ai-49-aHR0cD...TAuMTEwMy9wa...    10.1103/physrevc.49.3061       27  4    0.008262996
ai-49-aHR0cD...TAuMTM3MS9qb...    10.1371/journal.pone.0077786   38  15   0.018779194
ai-49-aHR0cD...TAuMTAwMi9sZ...    10.1002/ldr.3400040418         2   0    0.000982242
ai-49-aHR0cD...TAuMTEwMy9wa...    10.1103/physrevlett.81.3187    15  14   0.007743473
ai-49-aHR0cD...TAuMTAwMi9ub...    10.1002/nme.1620300822         7   6    0.004755116
ai-49-aHR0cD...TAuMTM3MS9qb...    10.1371/journal.pcbi.1002234   54  4    0.018582831
ai-49-aHR0cD...TAuMTAxNi8wM...    10.1016/0165-4896(94)00731-4   5   4    0.004127696
ai-49-aHR0cD...TAuMTA5My9qe...    10.1093/jxb/49.318.21          0   0    0.000267756
ai-49-aHR0cD...TAuMTE0Mi9zM...    10.1142/s0218126619500051      22  2    0.006445901
ai-49-aHR0cD...TAuNzg2MS9jb...    10.7861/clinmedicine.17-4-332  13  8    0.005840636
ai-49-aHR0cD...TAuMTM3My9jb...    10.1373/clinchem.2013.204446   20  11   0.011903923
ai-49-aHR0cD...TAuMTE0My9qa...    10.1143/jjap.9.958             0   7    0.002963267
ai-49-aHR0cD...TAuMTAyMS9hb...    10.1021/am8001605              29  64   0.022973696
ai-49-aHR0cD...TAuMTIwNy9zM...    10.1207/s15326934crj1401_1     0   21   0.056867545

Usage

usage: spindel [OPTION]

spindel is an experimental api server for labe; it works with three data stores.

* (1) an sqlite3 catalog id to doi translation table (11GB)
* (2) an sqlite3 version of OCI (145GB)
* (3) a key-value store mapping catalog ids to catalog entities (two
      implementations: 256GB microblob, 353GB sqlite3)

Each database may be updated separately, with separate processes; e.g.
currently we use the experimental mkocidb command turn (k, v) TSV files into
sqlite3 lookup databases.

Examples

- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA3My9wbmFzLjg1LjguMjQ0NA
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTAwMS9qYW1hLjI4Mi4xNi4xNTE5
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTAwNi9qbXJlLjE5OTkuMTcxNQ
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTE3Ny8xMDQ5NzMyMzA1Mjc2Njg3
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxMC9qYy4yMDExLTAzODU
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxNC9hb3MvMTE3NjM0Nzk2Mw
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMjMwNy8yMDk1NTIx

Bulk requests

    $ curl -sL https://git.io/JzVmJ |
    parallel -j 40 "curl -s http://localhost:3000/id/{}" |
    jq -rc '[.id, .doi, .extra.citing_count, .extra.cited_count, .extra.took] | @tsv'

Flags

  -C    enable in-memory caching of expensive responses
  -Cg duration
        cache trigger duration (default 250ms)
  -Ct duration
        cache ttl (default 8h0m0s)
  -Cx duration
        cache default expiration (default 72h0m0s)
  -I string
        identifier database path (default "i.db")
  -L    enable logging
  -O string
        oci as a datbase path (default "o.db")
  -Q string
        sqlite3 blob index path
  -S string
        solr blob URL
  -W    enable stopwatch
  -bs string
        blob server URL
  -l string
        host and port to listen on (default "localhost:3000")
  -version
        show version
  -z    enable gzip compression

Fetch or FetchMany

There seems to be not that much difference between one expensive IN and many cheap SQL queries in case of sqlite3. Keeping the API simple for now, just supporting Fetch for single items for now.

In [13]: df.took.describe() # SELECT .. WHERE k = ...
Out[13]:
count    64860.000000
mean         0.015844
std          0.038631
min          0.000371
25%          0.003467
50%          0.008700
75%          0.018176
max          6.080011
Name: took, dtype: float64

In [14]: dfin.took.describe() # SELECT .. IN ...
Out[14]:
count    64860.000000
mean         0.016218
std          0.038260
min          0.000321
25%          0.003560
50%          0.008910
75%          0.018594
max          5.718692
Name: took, dtype: float64

There may be a better way to ask SOLR for dozens or hundreds of ids at once.

Using a stopwatch

Experimental -W flag to trace duration of various operations.

$ spindel -W -bs http://localhost:8820


   _|_|_|            _|                  _|            _|
 _|        _|_|_|        _|_|_|      _|_|_|    _|_|    _|
   _|_|    _|    _|  _|  _|    _|  _|    _|  _|_|_|_|  _|
       _|  _|    _|  _|  _|    _|  _|    _|  _|        _|
 _|_|_|    _|_|_|    _|  _|    _|    _|_|_|    _|_|_|  _|
           _|
           _|

Examples

- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA3My9wbmFzLjg1LjguMjQ0NA
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTAwMS9qYW1hLjI4Mi4xNi4xNTE5
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTAwNi9qbXJlLjE5OTkuMTcxNQ
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTE3Ny8xMDQ5NzMyMzA1Mjc2Njg3
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxMC9qYy4yMDExLTAzODU
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxNC9hb3MvMTE3NjM0Nzk2Mw
- http://localhost:3000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMjMwNy8yMDk1NTIx

2021/09/29 17:35:19 spindel starting 3870a68 2021-09-29T15:34:00Z http://localhost:3000
2021/09/29 17:35:20 timings for XVlB

> XVlB    0    0s             0.00    started query for: ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5OC9yc3BhLjE5OTguMDE2NA
> XVlB    1    397.191µs      0.00    found doi for id: 10.1098/rspa.1998.0164
> XVlB    2    481.676µs      0.01    found 8 citing items
> XVlB    3    18.984627ms    0.23    found 456 cited items
> XVlB    4    13.421306ms    0.16    mapped 464 dois back to ids
> XVlB    5    494.163µs      0.01    recorded unmatched ids
> XVlB    6    44.093361ms    0.52    fetched 302 blob from index data store
> XVlB    7    6.422462ms     0.08    encoded JSON
> XVlB    -    -              -       -
> XVlB    S    84.294786ms    1.0     total

TODO

  • a better name, e.g. labesrv, labesvc, cdfuse, catfuse, labed, ...
  • a detailed performance report
  • tools or scripts to generate the input database from scratch

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrBlobNotFound can be used for unfetchable blobs.
	ErrBlobNotFound   = errors.New("blob not found")
	ErrBackendsFailed = errors.New("all backends failed")
)

Functions

This section is empty.

Types

type BlobServer

type BlobServer struct {
	BaseURL string
}

BlobServer implements access to a running microblob instance.

func (*BlobServer) Fetch

func (bs *BlobServer) Fetch(id string) ([]byte, error)

Fetch constructs a URL from a template and retrieves the blob.

func (*BlobServer) Ping

func (bs *BlobServer) Ping() error

Ping is a healthcheck.

type Entry

type Entry struct {
	T       time.Time
	Message string
}

Entry is a stopwatch entry.

type FetchGroup

type FetchGroup struct {
	Backends []Fetcher
}

FetchGroup allows to run a index data fetch operation in a cascade over a couple of backends.

func (*FetchGroup) Fetch

func (g *FetchGroup) Fetch(id string) ([]byte, error)

Fetch constructs a URL from a template and retrieves the blob.

func (*FetchGroup) Ping

func (g *FetchGroup) Ping() error

Ping is a healthcheck. Solr typically responds with 404 on the URL without any handler; http://localhost:8085/solr/biblio/admin/ping

type Fetcher

type Fetcher interface {
	Fetch(id string) ([]byte, error)
}

Fetcher fetches one or more blobs given their identifiers.

type Map

type Map struct {
	Key   string `db:"k"`
	Value string `db:"v"`
}

Map is a generic lookup table. We use it together with sqlite3.

type Pinger

type Pinger interface {
	Ping() error
}

Pinger allows to perform a simple health check.

type Response

type Response struct {
	ID        string            `json:"id"`
	DOI       string            `json:"doi"`
	Citing    []json.RawMessage `json:"citing,omitempty"`
	Cited     []json.RawMessage `json:"cited,omitempty"`
	Unmatched struct {
		Citing []json.RawMessage `json:"citing,omitempty"`
		Cited  []json.RawMessage `json:"cited,omitempty"`
	} `json:"unmatched"`
	Extra struct {
		Took                 float64 `json:"took"`
		UnmatchedCitingCount int     `json:"unmatched_citing_count"`
		UnmatchedCitedCount  int     `json:"unmatched_cited_count"`
		CitingCount          int     `json:"citing_count"`
		CitedCount           int     `json:"cited_count"`
		Cached               bool    `json:"cached"`
	} `json:"extra"`
}

Response contains a subset of index data fused with citation data. Citing and cited documents are unparsed. For unmatched docs, we keep only transmit the DOI, e.g. as {"doi": "10.123/123"}.

type Server

type Server struct {
	IdentifierDatabase *sqlx.DB
	OciDatabase        *sqlx.DB
	IndexData          Fetcher
	// Router to register routes on.
	Router *mux.Router
	// StopWatch is a builtin, simplistic tracer.
	StopWatchEnabled bool
	// Cache related configuration. We only want to cache expensive requests,
	// e.g. requests that too longer than CacheTriggerDuration to compute.
	CacheEnabled           bool
	CacheTriggerDuration   time.Duration
	CacheDefaultExpiration time.Duration
	CacheTTL               time.Duration
	// contains filtered or unexported fields
}

Server wraps three data sources required for index and citation data fusion. The IdentifierDatabase is a map from local identifier (e.g. 0-1238201) to DOI, the OciDatabase contains citing and cited relationsships from OCI/COCI citation corpus and IndexData allows to fetch a metadata blob from a service, e.g. a key value store like microblob, sqlite3, solr, elasticsearch or in memory store.

TODO: The server should be able to work with multiple Fetcher instances, e.g. to roll over to a new version or to use one for different data stores.

    server
     |
     v
    fetcher
     |   |_________ ....
     v         |
fetcher[main]  `-> fetcher[ai]
     |                |
     v                v
    db[main]         db[ai]

    (daily)          (monthly)

func (*Server) Ping

func (s *Server) Ping() error

Ping returns an error, if any of the datastores are not available.

func (*Server) Routes

func (s *Server) Routes()

Routes sets up route. TODO: we want a direct DOI route as well.

func (*Server) ServeHTTP

func (s *Server) ServeHTTP(w http.ResponseWriter, r *http.Request)

ServeHTTP turns the server into an HTTP handler.

type SolrBlob

type SolrBlob struct {
	BaseURL string
}

SolrBlob implements access to a running microblob instance. The base url would be something like http://localhost/solr/biblio (e.g. without the select part of the path).

func (*SolrBlob) Fetch

func (b *SolrBlob) Fetch(id string) ([]byte, error)

Fetch constructs a URL from a template and retrieves the blob.

func (*SolrBlob) Ping

func (b *SolrBlob) Ping() error

Ping is a healthcheck. Solr typically responds with 404 on the URL without any handler; http://localhost:8085/solr/biblio/admin/ping

type SqliteBlob

type SqliteBlob struct {
	DB *sqlx.DB
}

SqliteBlob serves index documents from sqlite database.

func (*SqliteBlob) Fetch

func (b *SqliteBlob) Fetch(id string) (p []byte, err error)

Fetch document.

func (*SqliteBlob) Ping

func (b *SqliteBlob) Ping() error

Ping pings the database.

type StopWatch

type StopWatch struct {
	sync.Mutex
	// contains filtered or unexported fields
}

StopWatch allows to record events over time and render them in a pretty table. Example log output (via stopwatch.LogTable()).

2021/09/29 17:22:40 timings for hTHc

> XVlB 0 0s 0.00 started query for: ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxMC9qYy4yMDExLTAzODU > XVlB 1 134.532µs 0.00 found doi for id: 10.1210/jc.2011-0385 > XVlB 2 67.918529ms 0.24 found 0 outbound and 4628 inbound edges > XVlB 3 32.293723ms 0.12 mapped 4628 dois back to ids > XVlB 4 3.358704ms 0.01 recorded unmatched ids > XVlB 5 68.636671ms 0.25 fetched 2567 blob from index data store > XVlB 6 105.771005ms 0.38 encoded JSON > XVlB - - - - > XVlB S 278.113164ms 1.00 total

By default a stopwatch is disabled, which means all functions will be noops, use SetEnabled to toggle mode.

func (*StopWatch) Elapsed

func (s *StopWatch) Elapsed() time.Duration

Elapsed returns the total elapsed time.

func (*StopWatch) LogTable

func (s *StopWatch) LogTable()

LogTable write a table using standard library log facilities.

func (*StopWatch) Record

func (s *StopWatch) Record(msg string)

Record records a message.

func (*StopWatch) Recordf

func (s *StopWatch) Recordf(msg string, vs ...interface{})

Recordf records a message.

func (*StopWatch) Reset

func (s *StopWatch) Reset()

Reset resets the stopwatch.

func (*StopWatch) SetEnabled

func (s *StopWatch) SetEnabled(enabled bool)

SetEnabled enables or disables the stopwatch. If disabled, any call will be a noop.

func (*StopWatch) Table

func (s *StopWatch) Table() string

Table format the timings as table.

Directories

Path Synopsis
cmd
spindel
An experimental API server for catalog and citation data.
An experimental API server for catalog and citation data.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL