memiavl

package
v1.0.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 8, 2023 License: Apache-2.0 Imports: 11 Imported by: 0

README

Alternative IAVL Implementation

Changelog

  • 11 Jan 2023: Initial version
  • 13 Jan 2023: Change changeset encoding from protobuf to plain one
  • 17 Jan 2023:
    • Add delete field to change set to support empty value
    • Add section about compression on snapshot format
  • 27 Jan 2023:
    • Update metadata file format
    • Encode key length with 4 bytes instead of 2.

The Journey

It started for an use case of verifying the state change sets, we need to replay the change sets to rebuild IAVL tree and check the final IAVL root hash, compare the root hash with the on-chain hash to verify the integrity of the change sets.

The first implementation keeps the whole IAVL tree in memory, mutate nodes in-place, and don't update hashes for the intermediate versions, and one insight from the test run is it runs surprisingly fast. For the distribution store in our testnet, it can process from genesis to block 6698242 in 2 minutes, which is around 55818 blocks per second.

To support incremental replay, we further designed an IAVL snapshot format that's stored on disk, while supporting random access with mmap, which solves the memory usage issue, and reduce the time of replaying.

New Design

So the new idea is we can put the snapshot and change sets together, the change sets is the write-ahead-log for the IAVL tree.

It also integrates well with versiondb, because versiondb can also be derived from change sets to provide query service. IAVL tree is only used for consensus state machine and merkle proof generations.

Advantages
  • Better write amplification, we only need to write the change sets in real time which is much more compact than IAVL nodes, IAVL snapshot can be created in much lower frequency.
  • Better read amplification, the IAVL snapshot is a plain file, the nodes are referenced with offset, the read amplification is simply 1.
  • Better space amplification, the archived change sets are much more compact than current IAVL tree, in our test case, the ratio could be as large as 1:100. We don't need to keep too old IAVL snapshots, because versiondb will handle the historical key-value queries, IAVL tree only takes care of merkle proof generations for blocks within an unbonding period. In very rare cases that do need IAVL tree of very old version, you can always replay the change sets from the genesis.

File Formats

NOTICE: the integers are always encoded with little endianness.

Change Set File
version: int64
size: int64         // size of whole payload
payload:
  delete: int8
  keyLen: varint-uint64
  key
  [                 // if delete is false
    valueLen: varint-uint64
    value
  ]

repeat with next version
  • Change set files can be splited with certain block ranges for incremental backup and restoration.

  • Historical files can be compressed with zlib, because it doesn't need to support random access.

IAVL Snapshot

IAVL snapshot is composed by four files:

  • metadata, 16bytes:

    magic: uint32
    format: uint32
    version: uint64
    root node index: uint32
    
  • nodes, array of fixed size(64bytes) nodes, the node format is like this:

    height  : uint8          // padded to 4bytes
    version : uint32
    size    : uint64
    key     : uint64         // offset in keys file
    left    : uint32         // inner node only
    right   : uint32         // inner node only
    value   : uint64 offset  // offset in values file, leaf node only
    hash    : [32]byte
    

    The node has fixed length, can be indexed directly. The nodes reference each other with the index, nodes are written in post-order, so the root node is always placed at the end.

    Some integers are using uint32, should be enough in forseeable future, but could be changed to uint64 to be safer.

    The implementation will read the mmap-ed content in a zero-copy way, won't use extra node cache, it will only rely on the OS page cache.

  • keys, sequence of length prefixed leaf node keys, ordered and no duplication.

    size: uint32
    payload
    *repeat*
    

    Key size is encoded in uint32, so the maximum key length supported is 1<<32-1, around 4G.

  • values, sequence of length prefixed leaf node values.

    size: uint32
    payload
    *repeat*
    

    Value size is encoded in uint32, so maximum value length supported is 1<<32-1, around 4G.

Compression

The items in snapshot reference with each other by file offsets, we can apply some block compression techniques to compress keys and values files while maintain random accessbility by uncompressed file offset, for example zstd's experimental seekable format^1.

VersionDB

VersionDB is to support query and iterating historical versions of key-values pairs, currently implemented with rocksdb's experimental user-defined timestamp feature, support query and iterate key-value pairs by version, it's an alternative way to support grpc query service, and much more compact than IAVL trees, similar in size with the compressed change set files.

Documentation

Index

Constants

View Source
const (
	OffsetHeight  = 0
	OffsetVersion = OffsetHeight + 4
	OffsetSize    = OffsetVersion + 4
	OffsetKey     = OffsetSize + 8
	OffsetRight   = OffsetKey + 8
	OffsetLeft    = OffsetRight + 4
	OffsetValue   = OffsetKey + 8
	OffsetHash    = OffsetValue + 8

	SizeHash            = sha256.Size
	SizeNodeWithoutHash = OffsetHash
	SizeNode            = SizeNodeWithoutHash + SizeHash

	// encoding key/value length as 4 bytes with little endianness.
	SizeKeyLen   = 4
	SizeValueLen = 4
)
View Source
const (
	// SnapshotFileMagic is little endian encoded b"IAVL"
	SnapshotFileMagic = 1280721225

	// the initial snapshot format
	SnapshotFormat = 0

	// magic: uint32, format: uint32, version: uint64, root node index: uint32
	SizeMetadata = 20

	// EmptyRootNodeIndex is a special value of root node index to represent empty tree
	EmptyRootNodeIndex = math.MaxUint32
)

Variables

This section is empty.

Functions

func EncodeBytes

func EncodeBytes(w io.Writer, bz []byte) error

EncodeBytes writes a varint length-prefixed byte slice to the writer, it's used for hash computation, must be compactible with the official IAVL implementation.

func HashNode

func HashNode(node Node) []byte

HashNode computes the hash of the node.

func Mmap

func Mmap(f *os.File) ([]byte, *[mmap.MaxMapSize]byte, error)

func VerifyHash

func VerifyHash(node Node) bool

VerifyHash compare node's cached hash with computed one

Types

type MemNode

type MemNode struct {
	// contains filtered or unexported fields
}

func (*MemNode) Hash

func (node *MemNode) Hash() []byte

Computes the hash of the node without computing its descendants. Must be called on nodes which have descendant node hashes already computed.

func (*MemNode) Height

func (node *MemNode) Height() int8

func (*MemNode) Key

func (node *MemNode) Key() []byte

func (*MemNode) Left

func (node *MemNode) Left() Node

func (*MemNode) Mutate

func (node *MemNode) Mutate(version int64) *MemNode

Mutate clears hash and update version field to prepare for further modifications.

func (*MemNode) Right

func (node *MemNode) Right() Node

func (*MemNode) Size

func (node *MemNode) Size() int64

func (*MemNode) Value

func (node *MemNode) Value() []byte

func (*MemNode) Version

func (node *MemNode) Version() int64

type MmapFile

type MmapFile struct {
	// contains filtered or unexported fields
}

MmapFile manage the resources of a mmap-ed file

func NewMmap

func NewMmap(path string) (*MmapFile, error)

Open openes the file and create the mmap. the mmap is created with flags: PROT_READ, MAP_SHARED, MADV_RANDOM.

func (*MmapFile) Close

func (m *MmapFile) Close() error

Close closes the file and mmap handles

func (*MmapFile) Data

func (m *MmapFile) Data() []byte

Data returns the mmap-ed buffer

type Node

type Node interface {
	Height() int8
	Size() int64
	Version() int64
	Key() []byte
	Value() []byte
	Left() Node
	Right() Node
	Hash() []byte

	// PersistedNode clone a new node, MemNode modify in place
	Mutate(version int64) *MemNode
}

Node interface encapsulate the interface of both PersistedNode and MemNode.

type PersistedNode

type PersistedNode struct {
	// contains filtered or unexported fields
}

PersistedNode is backed by serialized byte array, usually mmap-ed from disk file. Encoding format (all integers are encoded in little endian): - height : int8 // padded to 4bytes - version : int32 - size : int64 - key : int64 - left : int32 // node index, inner node only - right : int32 // node index, inner node only - value : int64 offset // leaf node only - hash : [32]byte

func (PersistedNode) Hash

func (node PersistedNode) Hash() []byte

func (PersistedNode) Height

func (node PersistedNode) Height() int8

func (PersistedNode) Key

func (node PersistedNode) Key() []byte

func (PersistedNode) Left

func (node PersistedNode) Left() Node

Left result is not defined for leaf nodes.

func (PersistedNode) Mutate

func (node PersistedNode) Mutate(version int64) *MemNode

func (PersistedNode) Right

func (node PersistedNode) Right() Node

Right result is not defined for leaf nodes.

func (PersistedNode) Size

func (node PersistedNode) Size() int64

func (PersistedNode) Value

func (node PersistedNode) Value() []byte

Value result is not defined for non-leaf node.

func (PersistedNode) Version

func (node PersistedNode) Version() int64

type Snapshot

type Snapshot struct {
	// contains filtered or unexported fields
}

Snapshot manage the lifecycle of mmap-ed files for the snapshot, it must out live the objects that derived from it.

func NewEmptySnapshot added in v0.0.4

func NewEmptySnapshot(version uint64) *Snapshot

func OpenSnapshot

func OpenSnapshot(snapshotDir string) (*Snapshot, error)

OpenSnapshot parse the version number and the root node index from metadata file, and mmap the other files.

func (*Snapshot) Close

func (snapshot *Snapshot) Close() error

Close closes the file and mmap handles, clears the buffers.

func (*Snapshot) IsEmpty added in v0.0.4

func (snapshot *Snapshot) IsEmpty() bool

IsEmpty returns if the snapshot is an empty tree.

func (*Snapshot) Node

func (snapshot *Snapshot) Node(index uint32) PersistedNode

Node returns the node by index

func (*Snapshot) RootNode

func (snapshot *Snapshot) RootNode() PersistedNode

RootNode returns the root node

func (*Snapshot) ScanNodes added in v0.0.4

func (snapshot *Snapshot) ScanNodes(callback func(node PersistedNode) error) error

ScanNodes iterate over the nodes in the snapshot order (depth-first post-order)

func (*Snapshot) Version added in v0.0.4

func (snapshot *Snapshot) Version() uint64

Version returns the version of the snapshot

type Tree

type Tree struct {
	// contains filtered or unexported fields
}

verify change sets by replay them to rebuild iavl tree and verify the root hashes

func New

func New() *Tree

New creates an empty tree at genesis version

func NewEmptyTree

func NewEmptyTree(version int64) *Tree

NewEmptyTree creates an empty tree at an arbitrary version.

func NewFromSnapshot added in v0.0.4

func NewFromSnapshot(snapshot *Snapshot) *Tree

NewFromSnapshot mmap the blob files and create the root node.

func NewWithInitialVersion

func NewWithInitialVersion(initialVersion int64) *Tree

New creates a empty tree with initial-version, it happens when a new store created at the middle of the chain.

func (*Tree) Remove

func (t *Tree) Remove(key []byte)

func (*Tree) RootHash

func (t *Tree) RootHash() []byte

RootHash updates the hashes and return the current root hash

func (*Tree) SaveVersion

func (t *Tree) SaveVersion(updateHash bool) ([]byte, int64, error)

SaveVersion increases the version number and optionally updates the hashes

func (*Tree) Set

func (t *Tree) Set(key, value []byte)

func (*Tree) Version

func (t *Tree) Version() int64

Version returns the current tree version

func (*Tree) WriteSnapshot

func (t *Tree) WriteSnapshot(snapshotDir string) error

WriteSnapshot save the IAVL tree to a new snapshot directory.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL