hamt

package module
v2.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 8, 2020 License: MIT Imports: 12 Imported by: 8

README

go-hamt-ipld

Travis CI

This package is a reference implementation of the IPLD HAMT used in the Filecoin blockchain. It includes some optional flexibility such that it may be used for other purposes outside of Filecoin.

HAMT is a "hash array mapped trie". This implementation extends the standard form by including buckets for the key/value pairs at storage leaves and CHAMP mutation semantics. The CHAMP invariant and mutation rules provide us with the ability to maintain canonical forms given any set of keys and their values, regardless of insertion order and intermediate data insertion and deletion. Therefore, for any given set of keys and their values, a HAMT using the same parameters and CHAMP semantics, the root node should always produce the same content identifier (CID).

See https://godoc.org/github.com/filecoin-project/go-hamt-ipld for more information and API details.

License

MIT © Whyrusleeping

Documentation

Overview

Package hamt provides a reference implementation of the IPLD HAMT used in the Filecoin blockchain. It includes some optional flexibility such that it may be used for other purposes outside of Filecoin.

HAMT is a "hash array mapped trie" https://en.wikipedia.org/wiki/Hash_array_mapped_trie. This implementation extends the standard form by including buckets for the key/value pairs at storage leaves and CHAMP mutation semantics https://michael.steindorfer.name/publications/oopsla15.pdf. The CHAMP invariant and mutation rules provide us with the ability to maintain canonical forms given any set of keys and their values, regardless of insertion order and intermediate data insertion and deletion. Therefore, for any given set of keys and their values, a HAMT using the same parameters and CHAMP semantics, the root node should always produce the same content identifier (CID).

Algorithm Overview

The HAMT algorithm hashes incoming keys and uses incrementing subsections of that hash digest at each level of its tree structure to determine the placement of either the entry or a link to a child node of the tree. A `bitWidth` determines the number of bits of the hash to use for index calculation at each level of the tree such that the root node takes the first `bitWidth` bits of the hash to calculate an index and as we move lower in the tree, we move along the hash by `depth x bitWidth` bits. In this way, a sufficiently randomizing hash function will generate a hash that provides a new index at each level of the data structure. An index comprising `bitWidth` bits will generate index values of `[ 0, 2^bitWidth )`. So a `bitWidth` of 8 will generate indexes of 0 to 255 inclusive.

Each node in the tree can therefore hold up to `2^bitWidth` elements of data, which we store in an array. In the this HAMT and the IPLD HashMap we store entries in buckets. A `Set(key, value)` mutation where the index generated at the root node for the hash of key denotes an array index that does not yet contain an entry, we create a new bucket and insert the key / value pair entry. In this way, a single node can theoretically hold up to `2^bitWidth x bucketSize` entries, where `bucketSize` is the maximum number of elements a bucket is allowed to contain ("collisions"). In practice, indexes do not distribute with perfect randomness so this maximum is theoretical. Entries stored in the node's buckets are stored in key-sorted order.

Parameters

This HAMT implementation:

• Fixes the `bucketSize` to 3.

• Defaults the `bitWidth` to 8, however within Filecoin it uses 5

• Defaults the hash algorithm to the 64-bit variant of Murmur3-x64

Further Reading

The algorithm used here is identical to that of the IPLD HashMap algorithm specified at https://github.com/ipld/specs/blob/master/data-structures/hashmap.md. The specific parameters used by Filecoin and the DAG-CBOR block layout differ from the specification and are defined at https://github.com/ipld/specs/blob/master/data-structures/hashmap.md#Appendix-Filecoin-hamt-variant.

Index

Constants

This section is empty.

Variables

View Source
var ErrMalformedHamt = fmt.Errorf("HAMT node was malformed")

ErrMalformedHamt is returned whenever a block intended as a HAMT node does not conform to the expected form that a block may take. This can occur during block-load where initial validation takes place or during traversal where certain conditions are expected to be met.

View Source
var ErrMaxDepth = fmt.Errorf("attempted to traverse HAMT beyond max-depth")

ErrMaxDepth is returned when the HAMT spans further than the hash function is capable of representing. This can occur when sufficient hash collisions (e.g. from a weak hash function and attacker-provided keys) extend leaf nodes beyond the number of bits that a hash can represent. Or this can occur on extremely large (likely impractical) HAMTs that are unable to be represented with the hash function used. Hash functions with larger byte output increase the maximum theoretical depth of a HAMT.

View Source
var ErrNotFound = fmt.Errorf("not found")

ErrNotFound is returned when a Find operation fails to locate the specified key in the HAMT

Functions

This section is empty.

Types

type KV

type KV struct {
	Key   []byte
	Value *cbg.Deferred
}

KV represents leaf storage within a HAMT node. A Pointer may hold up to `bucketSize` KV elements, where each KV contains a key and value pair stored by the user.

Keys are represented as bytes.

The IPLD Schema representation of this data structure is as follows:

type KV struct {
	key Bytes
	value Any
} representation tuple

func (*KV) MarshalCBOR

func (t *KV) MarshalCBOR(w io.Writer) error

func (*KV) UnmarshalCBOR

func (t *KV) UnmarshalCBOR(r io.Reader) error

type Node

type Node struct {
	Bitfield *big.Int   `refmt:"bf"`
	Pointers []*Pointer `refmt:"p"`
	// contains filtered or unexported fields
}

Node is a single point in the HAMT, encoded as an IPLD tuple in DAG-CBOR of shape:

[bytes, [Pointer...]]

where 'bytes' is the big.Int#Bytes() and the Pointers array is between 1 and `2^bitWidth`.

The Bitfield provides us with a mechanism to store a compacted array of Pointers. Each bit in the Bitfield represents an element in a sparse array where `1` indicates the element is present in the Pointers array and `0` indicates it is omitted. To look-up a specific index in the Pointers array you must first make a count of the number of `1`s (popcount) up to the element you are looking for. e.g. a Bitfield of `10010110000` shows that we have a 4 element Pointers array. Indexes `[1]` and `[2]` are not present, but index `[3]` is at the second position of our Pointers array.

(Note: the `refmt` tags are ignored by cbor-gen which will generate an array type rather than map.)

The IPLD Schema representation of this data structure is as follows:

type Node struct {
	bitfield Bytes
	pointers [Pointer]
} representation tuple

func LoadNode

func LoadNode(ctx context.Context, cs cbor.IpldStore, c cid.Cid, options ...Option) (*Node, error)

LoadNode loads a HAMT Node from the IpldStore and configures it according to any specified Option parameters. Where the parameters of this HAMT vary from the defaults (hash function and bitWidth), those variations _must_ be supplied here via Options otherwise the HAMT will not be readable.

Users should consider how their HAMT parameters are stored or specified along with their HAMT where the data is expected to have a long shelf-life as future users will need to know the parameters of a HAMT being loaded in order to decode it. Users should also NOT rely on the default parameters of this library to remain the defaults long-term and have strategies in place to manage variations.

func NewNode

func NewNode(cs cbor.IpldStore, options ...Option) *Node

NewNode creates a new IPLD HAMT Node with the given IPLD store and any additional options (bitWidth and hash function).

This function creates a new HAMT that you can use directly and is also used internally to create child nodes.

func (*Node) Copy

func (n *Node) Copy() *Node

Copy a HAMT node and all of its contents. May be useful for mutation operations where the original needs to be preserved in memory.

This operation will also recursively clone any child nodes that are attached as cached nodes.

func (*Node) Delete

func (n *Node) Delete(ctx context.Context, k string) error

Delete removes an entry entirely from the HAMT structure.

This operation will result in the modification of _at least_ one IPLD block via the IpldStore. Depending on the contents of the leaf node, this operation may result in a node collapse to shrink the HAMT into its canonical form for the remaining data. For an insufficiently random collection of keys at the relevant leaf nodes such a collapse may cascade to further nodes.

func (*Node) Find

func (n *Node) Find(ctx context.Context, k string, out interface{}) error

Find navigates through the HAMT structure to where key `k` should exist. If the key is not found, an ErrNotFound error is returned. If the key is found and the `out` parameter has an UnmarshalCBOR(Reader) method, the decoded value is returned. If found and the `out` parameter is `nil`, then `nil` will be returned (can be used to determine if a key exists where you don't need the value, e.g. using the HAMT as a Set).

Depending on the size of the HAMT, this method may load a large number of child nodes via the HAMT's IpldStore.

func (*Node) FindRaw

func (n *Node) FindRaw(ctx context.Context, k string) ([]byte, error)

FindRaw performs the same function as Find, but returns the raw bytes found at the key's location (which may or may not be DAG-CBOR, see also SetRaw).

func (*Node) Flush

func (n *Node) Flush(ctx context.Context) error

Flush saves and purges any cached Nodes recursively from this Node through its (cached) children. Cached nodes primarily exist through the use of Copy() operations where the entire graph is instantiated in memory and each child pointer exists in cached form.

func (*Node) ForEach

func (n *Node) ForEach(ctx context.Context, f func(k string, val interface{}) error) error

ForEach recursively calls function f on each k / val pair found in the HAMT. This performs a full traversal of the graph and for large HAMTs can cause a large number of loads from the IpldStore. This should not be used lightly as it can incur large costs.

func (*Node) MarshalCBOR

func (t *Node) MarshalCBOR(w io.Writer) error

func (*Node) Set

func (n *Node) Set(ctx context.Context, k string, v interface{}) error

Set key k to value v, where v is has a MarshalCBOR(bytes.Buffer) method to encode it.

func (*Node) SetRaw

func (n *Node) SetRaw(ctx context.Context, k string, raw []byte) error

SetRaw is similar to Set but sets key k in the HAMT to raw bytes without performing a DAG-CBOR marshal. The bytes may or may not be encoded DAG-CBOR (see also FindRaw for fetching raw form).

func (*Node) UnmarshalCBOR

func (t *Node) UnmarshalCBOR(r io.Reader) error

type Option

type Option func(*Node)

Option is a function that configures the node

See UseTreeBitWidth and UseHashFunction

func UseHashFunction

func UseHashFunction(hash func([]byte) []byte) Option

UseHashFunction allows you to set the hash function used for internal indexing by the HAMT.

Passing in the returned Option to NewNode will generate a new HAMT that uses the specified hash function.

The default hash function is murmur3-x64 but you should use a cryptographically secure function such as SHA2-256 if an attacker may be able to pick the keys in order to avoid potential hash collision (tree explosion) attacks.

func UseTreeBitWidth

func UseTreeBitWidth(bitWidth int) Option

UseTreeBitWidth allows you to set a custom bitWidth of the HAMT in bits (from 1-8).

Passing in the returned Option to NewNode will generate a new HAMT that uses the specified bitWidth.

The default bitWidth is 8.

type Pointer

type Pointer struct {
	KVs  []*KV   `refmt:"v,omitempty"`
	Link cid.Cid `refmt:"l,omitempty"`
	// contains filtered or unexported fields
}

Pointer is an element in a HAMT node's Pointers array, encoded as an IPLD tuple in DAG-CBOR of shape:

{"0": CID} or {"1": [KV...]}

Where a map with a single key of "0" contains a Link, where a map with a single key of "1" contains a KV bucket. The map may contain only one of these two possible keys.

There are between 1 and 2^bitWidth of these Pointers in any HAMT node.

A Pointer contains either a KV bucket of up to `bucketSize` (3) values or a link (CID) to a child node. When a KV bucket overflows beyond `bucketSize`, the bucket is replaced with a link to a newly created HAMT node which will contain the `bucketSize+1` elements in its own Pointers array.

(Note: the `refmt` tags are ignored by cbor-gen which will generate an array type rather than map.)

The IPLD Schema representation of this data structure is as follows:

type Pointer union {
	&Node "0"
	Bucket "1"
} representation keyed

type Bucket [KV]

func (*Pointer) MarshalCBOR

func (t *Pointer) MarshalCBOR(w io.Writer) error

func (*Pointer) UnmarshalCBOR

func (t *Pointer) UnmarshalCBOR(br io.Reader) error

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL