database

package module
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 26, 2023 License: BSD-3-Clause Imports: 14 Imported by: 7

README

go-libraryofcongress-database

Go package providing simple database and server interfaces for the CSV files produced by the sfomuseum/go-libraryofcongress package.

Important

This is work in progress and not documented properly yet. The code will continue to move around in the short-term. Everything you see here is still in the "proof-of-concept" phase. It should work but may still have bugs and probably lacks features.

Documentation

Documentation is incomplete at this time.

Motivation

The first goal is to have a simple, bare-bones HTTP server for querying data in the CSV files produced by the sfomuseum/go-libraryofcongress package.

The second goal is to be able to build, compile and deploy the web application and all its data (as SQLite databases) as a self-contained container image to a low-cost service like AWS App Runner or AWS Lambda Function URLs.

A third goal is to have a generic database interface such that the same code can be used with a variety of databases. As written the server tool only has a single database "driver" for querying SQLite databases but there are tools for indexing data in both Elasticsearch, OpenSeach, DynamoDB and SQLite databases.

Databases

Databases implement the LibraryOfCongressDatabase interface which exposes two method: One for indexing source and another for querying them.

This package used to bundle a bunch of different implementations of the LibraryOfCongressDatabase interface but most of them have been moved in to separate packages. This package now only bundles two implementations:

  • SQLDatabase which implements the LibraryOfCongressDatabase interface using Go's database/sql abstraction layer. This package automatically loads the modernc/sqlite package for use with SQLite databases.

  • StdoutDatabase which implements the indexing methods of the LibraryOfCongressDatabase interface and simply emits each record as a CSV-encoded row to STDOUT. This package is really only for testing and debugging purposes and to serve as a simple reference implementation for creating your own implementations.

Other database implementations include:

Tools

$> make cli
go build -mod vendor -ldflags="-s -w" --tags fts5 -o bin/server cmd/server/main.go
go build -mod vendor -ldflags="-s -w" --tags fts5 -o bin/query cmd/query/main.go
go build -mod vendor -ldflags="-s -w" --tags fts5 -o bin/index cmd/index/main.go
index
$> ./bin/index -h
  -database-uri string
    	A valid sfomuseum/go-libraryofcongress-database URI.
  -lcnaf-data string
    	The path to your LCNAF CSV data. If '-' then data will be read from STDIN.
  -lcsh-data string
    	The path to your LCSH CSV data. If '-' then data will be read from STDIN.

Note that the index tool expects to index CSV data produced by the sfomuseum/go-libraryofcongress parse-lcnaf and parse-lcsh tools.

SQL (SQLite)
$> ./bin/index -database-uri 'sql://sqlite?dsn=loc.db' -lcsh-data /usr/local/data/lcsh.csv.bz2
processed 5692 records in 1m0.001390571s (started 2021-10-27 15:52:35.790947 -0700 PDT m=+0.015128394)
processed 11161 records in 2m0.001847245s (started 2021-10-27 15:52:35.790947 -0700 PDT m=+0.015128394)
processed 16179 records in 3m0.000195064s (started 2021-10-27 15:52:35.790947 -0700 PDT m=+0.015128394)
processed 20693 records in 4m0.003592035s (started 2021-10-27 15:52:35.790947 -0700 PDT m=+0.015128394)
...time passes

processed 438053 records in 2h3m0.000624126s (started 2021-10-27 15:52:35.790947 -0700 PDT m=+0.015128394)
processed 441373 records in 2h4m0.002261248s (started 2021-10-27 15:52:35.790947 -0700 PDT m=+0.015128394)
processed 444805 records in 2h5m0.002327734s (started 2021-10-27 15:52:35.790947 -0700 PDT m=+0.015128394)
2021/10/27 17:58:09 Finished indexing lcsh

$> du -h -d 1 /usr/local/data/loc.db/
761M	loc.db/
STDIN

It is also possible to index data from STDIN by specifying the string "-" as the -lcsh-data or -lcnaf-data URI to read.

For example, this command will stream and parse the contents of https://id.loc.gov/download/lcsh.both.ndjson.zip (using the parse-lcsh tool in the sfomuseum/go-libraryofcongress package) and index each subject header in a SQLite database called 'loc.db

$> ./parse-lcsh https://id.loc.gov/download/lcsh.both.ndjson.zip | \
	./index \
	-database-uri 'sql://sqlite?dsn=loc.db' \
	-lcsh-data -
server

The server tool is a simple web interface providing humans and robots, both, the ability to query a database.

$> ./bin/server -h
Usage of ./bin/server:
  -database-uri string
    	A valid sfomuseum/go-libraryofcongress-database URI.
  -per-page int
    	The number of results to return per page (default 20)
  -server-uri string
    	A valid aaronland/go-http-server URI. (default "http://localhost:8080")

To start the server you might do something like this:

$> ./bin/server -database-uri 'sql://sqlite?dsn=data/lcsh.db' -per-page 10
2021/10/18 13:11:24 Listening on http://localhost:8080

And then if you opened http://localhost:8080/?q=River&page=2 in a web browser you'd see this:

There is also an API endpoint for querying the data as JSON:

$> curl -s 'http://localhost:8080/api/query?q=SQL' | jq
{
  "results": [
    {
      "id": "sh96008008",
      "label": "PL/SQL (Computer program language)",
      "source": "lcsh"
    },
    {
      "id": "sh86006628",
      "label": "SQL (Computer program language)",
      "source": "lcsh"
    },
    {
      "id": "sh90004874",
      "label": "SQL*PLUS (Computer program language)",
      "source": "lcsh"
    },
    {
      "id": "sh87001812",
      "label": "SQL/ORACLE (Computer program language)",
      "source": "lcsh"
    }
  ],
  "pagination": {
    "total": 4,
    "per_page": 10,
    "page": 1,
    "pages": 1,
    "next_page": 0,
    "previous_page": 0,
    "pages_range": []
  }
}
query
$> ./bin/query -h
  -cursor-pagination
    	Signal that pagination is cursor-based rather than countable.
  -database-uri string
    	A valid sfomuseum/go-libraryofcongress-database URI.

The query tool is a command-line application to perform fulltext queries against a database generated using data produced by the tools in sfomuseum/go-libraryofcongress package.

SQL (SQLite)
$> ./bin/query \
	-database-uri 'sql://sqlite?dsn=test.db' \
	Montreal
	
lcsh:sh85087079 Montreal River (Ont.)
lcsh:sh2010014761 Alfa Romeo Montreal automobile
lcsh:sh2017003022 Montreal Massacre, Montréal, Québec, 1989

Docker

Yes, there is a Dockerfile for the server tool. The simplest way to get started is to run the docker target in this package's Makefile:

$> make docker

And then to start the server:

$> docker run -it -p 8080:8080 \
	-e LIBRARYOFCONGRESS_DATABASE_URI='sql://sqlite?dsn=/usr/local/data/lcsh.db' \
	-e LIBRARYOFCONGRESS_SERVER_URI='http://0.0.0.0:8080' \
	libraryofcongress-server

And then visit http://localhost:8080 in a web browser.

Notes
  • As written the Dockerfile will copy all files ending in .db in the data folder in to the container's /usr/local/data folder.

See also

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func QueryPaginated added in v0.0.3

func QueryPaginated(ctx context.Context, db LibraryOfCongressDatabase, q string, pg_opts pagination.Options, cb QueryPaginatedCallbackFunc) error

func RegisterLibraryOfCongressDatabase

func RegisterLibraryOfCongressDatabase(ctx context.Context, scheme string, f LibraryOfCongressInitializeFunc) error

func Schemes

func Schemes() []string

Types

type LibraryOfCongressDatabase

type LibraryOfCongressDatabase interface {
	Index(context.Context, []*Source, timings.Monitor) error
	Query(context.Context, string, pagination.Options) ([]*QueryResult, pagination.Results, error)
}

func NewLibraryOfCongressDatabase

func NewLibraryOfCongressDatabase(ctx context.Context, uri string) (LibraryOfCongressDatabase, error)

type LibraryOfCongressInitializeFunc

type LibraryOfCongressInitializeFunc func(ctx context.Context, uri string) (LibraryOfCongressDatabase, error)

type QueryPaginatedCallbackFunc added in v0.0.3

type QueryPaginatedCallbackFunc func(context.Context, []*QueryResult) error

type QueryResult

type QueryResult struct {
	Id     string `json:"id"`
	Label  string `json:"label"`
	Source string `json:"source"`
}

func (*QueryResult) String added in v0.0.3

func (r *QueryResult) String() string

type Source

type Source struct {
	Label  string
	Reader io.ReadCloser
}

func SourcesFromPaths

func SourcesFromPaths(ctx context.Context, data_paths map[string]string) ([]*Source, error)

func (*Source) Index

func (src *Source) Index(ctx context.Context, cb SourceIndexCallback) error

type SourceIndexCallback

type SourceIndexCallback func(context.Context, map[string]string) error

Directories

Path Synopsis
app
cmd
http
api
www
templates

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL