dagstore

package module
v0.0.0-...-4b82f5b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 28, 2023 License: BSD-3-Clause Imports: 24 Imported by: 0

README

GoDoc Go Report Card

NOTE: THIS PROJECT IS NOT READY FOR USE AT THIS TIME.

DOCUMENTATION BELOW IS A PLACEHOLDER.


DAGStore

A system for branched versioning of data
Explore the docs »

Try a Demo · Report Bug

Table of Contents
  1. About The Project
  2. DAGStore Architecture in Nutshell
  3. Getting Started
  4. Usage
  5. Roadmap
  6. Contributing
  7. License
  8. Acknowledgments

About The Project

DAGStore is a storage system for large-scale branched versioning of data types including key-value pairs, a log, and N-dimensional chunks. It distills lessons learned over a decade developing and using DVID, a dataservice supporting branched versioning of reconstruction data in the field of Connectomics at HHMI Janelia. Domain-specific applications (like DVID) can build higher-level data operations on top of DAGStore by using the mix of supported data types. As described below, DAGStore's architecture will simplify devops issues that became evident through real-world use of branched versioning at terabyte scale, and allow a number of novel features like "chained" immutable stores.

The initial prototype will focus on the ordered key-value store and a Go language implementation (since DVID was written in Go). If DAGStore proves useful, the hope is to have implementations in other languages, at least to serve up immutable versions already stored in various formats.

Aside from scientific work that requires checkpointed versions of large-scale data, the general community might find DAGStore useful for their own work. To that end, a FUSE-based filesystem and branched versioning tool was written as a simple application layer on top of DAGStore. This tool can be tested by installing the dags command-line tool and running dags fs.

DAGStore Architecture in Nutshell

DAGStore explicitly separates mutable data from immutable data through a manual “commit” operation similar to how commits work in the git version control system. This explicit separation provides a number of advantages including limiting the use of any suitable off-the-shelf database for newly written data while maintaining a common, transferable data format for the committed, immutable data.

DAGStore physically separates immutable data into versions, which allows easy transfer of versions and “chained” stores where remote DAGStores provide immutable data for ancestors of versions. Typically only changes are stored in a version though a published version could be a flattened snapshot of the entire dataset. For Open Science, this physical separation can leverage published open data by delegating storage and access costs to large scientific institutions and repositories like Google Public Datasets and AWS OpenData. In the case of Connectomics, each published dataset is now tens of terabytes with the expected sizes to grow substantially over the next decade. DAGStore will allow users to modify public datasets yet limit their data handling burden to just their alterations which can be stored locally, combining their local changes with reads from unchanged data in ancestor versions.

(back to top)

Getting Started

This package can be used either as a Go library, providing an embedded datastore, or a standalone server by launching it using the "dags" command-line tool. Both cases are described below.

Install for use as a Go library
  1. Install Go
  2. Get the package from GitHub.
go get github.com/dagstore/dagstore-go
Install as command-line tool

The simplest approach is to get the single binary executable for your platform and use the dags command-line tool.

TODO -- Provide single binary executables for popular platforms via github Releases.

Via source-code:

  1. Install Go
  2. Clone the package from GitHub and build the dags cli tool:
git clone https://github.com/dagstore/dagstore-go
cd dagstore-go/dags
go build  # this creates a "dags" executable in that directory
./dags fs mount /path/to/DAGStore/dir /path/to/mount/dir # demo tool using FUSE filesystem interface

(back to top)

Usage

This space will be fleshed in later. For now, please consult the automatically-generated GoDocs.

In the future, more exampples will be in the project's website section for Documentation

(back to top)

Roadmap

  • Framework for branched versioning metadata
  • Framework for human-friendly names
  • Command-line use
    • Scaffolding via Cobra
    • FUSE interface for demo and testing
    • Server for using DAGStore standalone instead of a library
  • Mutable store using Badger
  • Immutable store
    • Index using Badger
    • Cloud-friendly immutable data storage by version
  • Key-value support
    • Local-only
    • Support chained immutable stores.
    • Add rpc (probably gRPC) to support cli from even embedded library use.
  • Branched versioning for logs.
  • Branched versioning for N-Dimensional chunks.
    • Support static neuroglancer precomputed versions.
    • Support static zarr versions.
    • Support static N5 versions.
  • Mechanism for pull requests by transfer of immutable version data.
    • Ingestion of transferred versions
    • Nice workflow allowing manual inspect, perhaps using Github API

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributing to Go implementation

Contributions are appreciated. Please fork the repo and create a pull request. You can also open an issue with the tag "enhancement".

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request
Creating an implementation in another language

We hope to have implementations in other languages, at least to serve up immutable versions already stored in various formats. This would allow reuse of libraries focused on particular kinds of immutable data. For example, a C++ implementation could more easily build on top of tensorstore and leverage that system to efficiently read a variety of data formats.

Please open an issue with the tag "implementations" with a description of your intended system. This might help build momentum for that implementation. We could also host other language implentations under the "DAGStore" github organization.

(back to top)

License

Distributed by HHMI Janelia under a BSD 3-Clause License. See LICENSE.txt for more information.

(back to top)

Acknowledgments

  • Thanks to dgraph and the creators of the Badger Go key-value database
  • This system wouldn't be possible without the fantastic open source community. A look at the go.mod file will show the first-level of work we've built upon. Those not in that Go source code list include the Go language team, Best-README-Template, and others.

(back to top)

Documentation

Overview

Package dagstore provides a branched versioned store of data where committed versions can be stored in a distributed fashion.

Index

Constants

View Source
const (
	// DirMutableStore is the name of the mutable store subdirectory within the
	// DAGStore directory.
	DirMutableStore = "mutable"

	// DirImmutableStore is the name of the immutable store subdirectory within
	// the DAGStore directory.
	DirImmutableStore = "immutable"

	// FnameVersions is the file name of the JSON versions (DAG + local ids).
	// This is typically stored in the root directory.
	FnameVersions = "versions.json"

	// FnameDataCatalog is the file name of the JSON data catalog (inc. local ids).
	// This is typically stored in the root directory.
	FnameDataCatalog = "data_catalog.json"
)
View Source
const (
	FlagTombstone  = 0x01
	FlagValueEmpty = 0x02
	FlagValueNil   = 0x04
)
View Source
const (
	// DirImmutableIndex is the name of the index subdirectory within
	// the immutable store subdirectory.
	DirImmutableIndex = "index"
)
View Source
const (
	// FnameMetadata is the file name of the persisted JSON for each data's metadata.
	FnameMetadata = "metadata.json"
)
View Source
const NilUUID = UUID("")

NilUUID is an invalid UUID.

Variables

View Source
var (
	ErrBadUserFlags = errors.New("user flags must be first 5 bits of a byte")
)

Functions

func Shutdown

func Shutdown()

Shutdown should be called when the program ends so DAGStore does any cleanup.

Types

type Config

type Config struct {
	Path      string // The file path to a parent directory for dagstore repos.
	RepoAlias string // A human friendly name for a particular repo like "CNS".
	RootUUID  UUID   // The full 32 character hex UUID of the root node.
}

type DAGStore

type DAGStore struct {
	// contains filtered or unexported fields
}

DAGStore is a store for branched versioning of large-scale data.

func Initialize

func Initialize(c Config, readonly, immutable bool) (ds *DAGStore, created bool, err error)

Initialize creates a DAGStore from the given configuration.

func (*DAGStore) Commit

func (ds *DAGStore) Commit(version UUID, data ...UUID) error

Commit marks a set of data as committed for the given version. TODO: See what people think of this approach

func (*DAGStore) Delete

func (ds *DAGStore) Delete(id ID, key []byte) error

Delete removes a key-value pair at this version by placing a tombstone. Prior committed values for the key can still be retrieved.

func (*DAGStore) Exists

func (ds *DAGStore) Exists(id ID, key []byte) (bool, error)

Exists returns true if the key exists at the given version.

func (*DAGStore) Get

func (ds *DAGStore) Get(id ID, key []byte) (*KVwithFlags, error)

Get returns the value and flags associated with a key at the given version.

func (*DAGStore) GetBranches

func (ds *DAGStore) GetBranches() map[string]UUID

GetBranches returns branches and their HEAD version.

func (*DAGStore) GetRange

func (ds *DAGStore) GetRange(ctx context.Context, id ID, keyStart, keyEnd []byte) ([]*KVwithFlags, error)

GetRange returns the key-value pairs at given version within a range of keys.

func (*DAGStore) GetTaggedVersion

func (ds *DAGStore) GetTaggedVersion(tag string) UUID

GetTaggedVersion returns the version associated with a tag.

func (*DAGStore) NewData

func (ds *DAGStore) NewData(spec *DataSpec, parentGroup UUID) error

NewData creates a new Group with given parent. If a UUID is not specified in the DataID parameter, a new UUID is assigned. The metadata parameter is optional and can be an empty string.

func (*DAGStore) NewVersion

func (ds *DAGStore) NewVersion(parent, child UUID, branch string) (UUID, error)

NewVersion extends the DAG so the given branch has a new uncommitted node.

func (*DAGStore) Put

func (ds *DAGStore) Put(id ID, kv KVwithFlags) error

Put writes a value with given key.

func (*DAGStore) StreamAll

func (ds *DAGStore) StreamAll(ctx context.Context, id ID, keysOnly bool, out chan *KVwithFlags) error

StreamAll sends all keys and optionally values for a given dataset and version.

func (*DAGStore) StreamRange

func (ds *DAGStore) StreamRange(ctx context.Context, id ID, keyStart, keyEnd []byte, ordered, keysOnly bool, out chan *KVwithFlags) error

StreamRange sends keys and optionally values at a given version within a range of keys.

func (*DAGStore) TagVersion

func (ds *DAGStore) TagVersion(version UUID, tag string) error

TagVersion bookmarks a version with a string identifier.

type DataSpec

type DataSpec struct {
	// Name should be human-friendly and can be altered
	Name string

	// GlobalID provides a unique identifier even in distributed systems.
	GlobalID UUID

	// Datatype specifies either a Group or a supported storage type like KeyValue.
	Datatype DataSpecType

	// Optional metadata that can be left as empty value
	Metadata string
}

type DataSpecType

type DataSpecType uint8
const (
	// DataUnknown is default empty value that signifies the type was not set.
	DataUnknown DataSpecType = iota

	// DataGroup can contain other Groups or datasets of various type.
	DataGroup

	// DataKeyValue supports storage and range reads of key-value pairs.
	DataKeyValue

	// DataLog supports storage of logs
	DataLog

	// DataND supports storage and range reads of n-dimensional data chunks.
	DataND
)

type Flags

type Flags struct {
	// contains filtered or unexported fields
}

Flags can specify tombstones, 0 byte values, and 5-bit user field.

func (Flags) GetUserBits

func (kf Flags) GetUserBits() byte

func (Flags) IsEmptyValue

func (kf Flags) IsEmptyValue() bool

func (Flags) IsTombstone

func (kf Flags) IsTombstone() bool

func (*Flags) SetUserBits

func (kf *Flags) SetUserBits(bits byte) error

type ID

type ID struct {
	Data    UUID // Data = Group or Dataset
	Version UUID
}

ID provides global identification of both the dataset/group and the desired version.

type KV

type KV struct {
	Key   []byte
	Value []byte
}

KV represents a key-value pair.

type KVwithFlags

type KVwithFlags struct {
	KV
	Flags
}

KVwithFlags is a key-value pair with tombstone, empty value, and optional 6-bit user flags.

type KVwithVersion

type KVwithVersion struct {
	KV
	Flags
	Version RawUUID
}

KVwithVersion is a key-value pair with KeyFlags and version. This is useful for low-level transmission of data, like importing DAGStore versioned data into an alternative datastore.

type RawUUID

type RawUUID [16]byte

RawUUID is the 16-byte RFC4122 UUID, thereby 2x smaller than the hex string version.

func (RawUUID) UUID

func (ru RawUUID) UUID() UUID

type UUID

type UUID string

UUID is a 32 character hexidecimal string (a RFC4122 version 4 UUID) that uniquely identifies nodes in a DAG. We need universally unique identifiers to prevent collisions when distributed DAGStores create new versions: http://en.wikipedia.org/wiki/Universally_unique_identifier

func NewUUID

func NewUUID() UUID

NewUUID returns a UUID

func StringToUUID

func StringToUUID(s string) (UUID, error)

StringToUUID converts a string to a UUID, checking to make sure it is a 32 character hex string. If it isn't a valid UUID, a NilUUID is returned.

type UUIDType

type UUIDType string

UUIDType describes the kind of a UUID such as data or version.

const (
	UUIDTypeInvalid UUIDType = ""
	UUIDTypeData    UUIDType = "data"
	UUIDTypeVersion UUIDType = "version"
)

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL