lsmtree

package module
v0.0.0-...-d9c4d01 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 22, 2019 License: Apache-2.0 Imports: 11 Imported by: 0

README

lsmtree

Build Status

This is my attempt at creating an LSM-Tree in Go. This is primarily a research project to learn about the inner workings and limitations of LSM-Tree's the hard way. At the moment this library is not meant to be used.

Background

Here are some links to information or research that I'm using to build this library.

Design

This library's primary goal is to implement an LSM-Tree based Key-Value store.

The goals of this library:

  • Transactional
    • SNAPSHOT Isolation
      • When transaction X begins, all changes committed after that point in time are not visible to transaction X. But all changes committed before are visible.
      • If a value is read during transaction X and that value changes before transaction X is committed, then transaction X must fail due to a conflict. This is to make sure that changes made during X are not based off of stale data.
    • READ COMMITTED Isolation
      • When transaction X reads values, it will only be able to see values from other transactions that have been completely committed. If a transaction is not committed or is partially committed then it's values should not be visible to X.
      • If a value is read during transaction X and that value changes before transaction X is committed, then transaction X must fail due to a conflict. This is to make sure that changes made during X are not based off of stale data.
    • A transaction can have multiple iterators. But writes that happen within the current transaction after an iterator has been created will be invisible to the iterator.
  • Multiple individual managed LSM-Trees (referred to as Tables).
    • Writes to any table is still written to a single WAL.
    • Reads can only target a single table.
    • Each table has it's own set of Heap files for Key storage.
  • Values are stored separately from keys (called Value Files).
    • Values should be stored in their own files and should be broken up into X sized chunks.
    • Each value should have a checksum to ensure the value has not been corrupted.
  • Keys are stored in their own files (called Heap Files).
    • Heap files are specific to a single table.
    • Each heap file should be sorted (descending) by key and transaction timestamp.
    • Heap files are created when the number of keys in memory reaches a certain threshold. The keys are then flushed to the disk in the form of a heap file. The highest number heap file for a given table is the most recent.
    • Heap file names consist of:
      • 1 Byte indicating it is a heap file.
      • 2 Bytes indicating the tableId.
      • 8 Bytes indicating the heapId.
      • 4 Bytes indicating the number of times this heap has been merged.
    • (Compaction) When a multiple heaps are merged, they will be use the largest heapId of the heaps being merged. The resulting file will increment the number of times that older files have been merged into that heap.
      • If a merge results in a single heap, then the merge counter can be reset to 0.
    • (Compaction) Heaps will be merged asynchronously and will be merged when the number of heaps exceeds a certain threshold or when the oldest heap is more than a X hours old.
      • Heaps should be merged into a single resulting heap, at which point the pointer for a range of keys should be updated asynchronously and should not block reads or writes if possible.
  • Write Ahead Log
    • When writes are committed, they must first be written to the WAL file. If the writes fail to write to the WAL then the commit should fail.
    • Once a batch of writes has been persisted to the WAL the writes should be committed to the in memory data set. This should be done in such a way that transactions that begin during this window will not be able to see the changes at least until AFTER the memtable change has completed.
    • Items should only be written to heap and value files AFTER they have been written to the WAL and the memtable. Making written data available to the user is far more important than storing that data in it's on medium.
    • When the database is opened check the WAL index for the last items written to the disk. If there are items that have not been written to the disk yet then write them before building the memtables.

Write Ahead Log

The following diagram illustrates the desired format for the WAL file. When a new entry is added to the file it will be inserted into a free space starting at the end of the file and the transactionId along with the start and end offsets will be appended to the beginning of the file.

WAL File Layout

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrCantReadFreeSpace is returned when a buffer has 8 bytes preceding it to indicate a
	// freeSpace map, but the freeSpace map could not be read.
	ErrCantReadFreeSpace = errors.New("could not read freeSpace")

	// ErrInsufficientSpace is returned when a buffer does not have enough space to insert the data
	// it is trying to allocate.
	ErrInsufficientSpace = errors.New("insufficient free space")
)
View Source
var (
	// ErrBadValueChecksum is returned when a value is read from the value file, but the checksum
	// stored with the value does not match the calculated checksum of the value read. This is used
	// as an indicator of file corruption.
	ErrBadValueChecksum = errors.New("bad value checksum")

	// ErrBrokenValue is returned when the entire value could not be read from from the value file.
	// Or when the entire value could not be written to the file.
	ErrIncompleteValue = errors.New("incomplete value")

	// ErrCreatingChecksum is returned when a value is being written to the value file but the
	// checksum could not be created.
	ErrCreatingChecksum = errors.New("could not create checksum for value")
)

Functions

This section is empty.

Types

type CanSync

type CanSync interface {
	Sync() error
}

CanSync is used to check if the current IO interface that a file wrapper is using has a method that allows its changes to be flushed to the disk.

type DB

type DB struct {
	// contains filtered or unexported fields
}

DB is the root object for the database. You can open/create your DB by calling Open().

func Open

func Open(options Options) (*DB, error)

Open will open or create the database using the provided configuration.

func (*DB) Close

func (db *DB) Close() error

Close will close any open files and stop any background writes. Any writes that have not been returned successfully will not have been written to the database.

type Item

type Item struct {
	Key     Key
	Value   []byte
	Version uint64
}

type Itr

type Itr interface {
	Seek(prefix []byte)
	Next()
	Item() Item
}

type Key

type Key []byte

Key represents an array that will NOT have an 8 byte suffix that is used to indicate the transactionId for the item.

type Options

type Options struct {
	// MaxWALSegmentSize (in bytes) is the largest a single WAL segment file will grow to before a
	// new segment is started. This does not include the last transaction to be appended to a single
	// WAL segment. If the last transaction puts the segment over this limit then it will still be
	// appended (resulting in a large segment) but then a new segment will be created for subsequent
	// transactions.
	// Default is 8kb.
	MaxWALSegmentSize uint64

	// MaxValueChunkSize (in byteS) is the largest a single Value file will grow to before a new
	// file is created. This does not include the last value appended to the value file.
	// Default is 32kb.
	MaxValueChunkSize uint64

	// WALDirectory is the folder where WAL segment files will be stored.
	// Default is db/wal.
	WALDirectory string

	// DataDirectory is the folder where heap and value files will be stored.
	// Default is db/data.
	DataDirectory string

	// Number of pending writes that can be queued up concurrently before transaction commits will
	// be blocked.
	PendingWritesBuffer int
}

Options is used to configure how the database will behave.

func DefaultOptions

func DefaultOptions() Options

DefaultOptions just provides a basic configuration which can be passed to open a database.

type ReaderWriterAt

type ReaderWriterAt interface {
	io.ReaderAt
	io.WriterAt
}

ReaderWriterAt is used as the interface for reading and writing data for the database. It can be used in nearly every IO portion of the database.

type TimestampedKey

type TimestampedKey []byte

TimestampedKey represents a byte array that will always have an 8 byte suffix to indicate the transactionId for the item. This is used to implement MVCC.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL