offheap

package module
v0.0.0-...-f122ace Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 9, 2024 License: MIT Imports: 9 Imported by: 0

README

offheap (formerly go-offheap-hashtable)

Docs:

http://godoc.org/github.com/glycerine/offheap

When GC pauses are long because you've got big hash tables in Go, you need an off-heap hash-table.

The purpose here is to have a hash table that can work away from Go's Garbage Collector, to avoid long GC pause times.

We accomplish this by writing our own Malloc() and Free() implementation (see malloc.go) which requests memory directly from the OS. The keys, the values, and the entire hash table itself is kept in this off-heap storage. This storage can also optionally be backed by a memory mapped file for speedy persistence and fast startup times.

Update: the cgo-malloc branch of this github repo has an implementation that uses CGO to call the malloc/calloc/free functions in the C stdlib. Using CGO gives up the save-to-disk instantly feature and creates a portability issue where you have linked against a specific version of the C stdlib. However if you are making/destroying alot of tables, the CGO approach may be faster.

See offheap.go for all the interesting code. Modify val_t to hold you values, and key_t to contain your keys. Current sample code for three types of keys (int64, []byte, and strings) is provided (see bytekey.go). For the hashing function itself, the incredibly fast xxhash64 is used to produce uint64 hashes of strings and []byte.

Note that all your key and values should be inline in the Cell. If you point back into the go-heap, such values maybe garbage collected by the Go runtime without notice.

On Save(), serialization of the HashTable itself is done using msgpack to write bytes to the first page (4k bytes) of the memory mapped file. This uses github.com/tinylib/msgp which is a blazing fast msgpack serialization library. It is fast because it avoids reflection and pre-computes the serializations (using go generate based inspection of your go source). If you need to serialize your values into the Val_t, I would suggest evaluating the msgp for serialization and deserialization. The author, Philip Hofer, has done a terrific job and put alot of effort into tuning it for performance. If you are still pressed for speed, consider also omitting the field labels using the '//msgp:tuple MyValueType' annotation. As Mr. Hofer says, "For smaller objects, tuple encoding can yield serious performance improvements." [https://github.com/tinylib/msgp/wiki/Preprocessor-Directives].

Initial HashTable implementation inspired by the public domain C++ code of

https://github.com/preshing/CompareIntegerMaps

See also

http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array/

for performance studies of the C++ code.

installation

 go get -t github.com/glycerine/go-offheap-hashtable

testing

go test -v

The implementation was test driven and includes over 4500 correctness checks.

Copyright (C) 2015 by Jason E. Aten, Ph.D.

License: MIT.

Documentation

Overview

Offheap

An off-heap hash-table for Go (golang). Originally called go-offheap-hashtable, but now shortened to just offheap.

The purpose here is to have a hash table that can work away from Go's Garbage Collector, to avoid long GC pause times.

We accomplish this by writing our own Malloc() and Free() implementation (see malloc.go) which requests memory directly from the OS.

The keys, values, and entire hash table is kept on off-heap storage. This storage can also optionally be backed by memory mapped file for speedy persistence and fast startup times.

Initial HashTable implementation inspired by the public domain C++ code of

https://github.com/preshing/CompareIntegerMaps

See also

http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array/

for performance studies of the C++ code.

HashTable

The implementation is mostly in offheap.go, read that to start.

Maps pointer-sized integers to Cell structures, which in turn hold Val_t as well as Key_t structures.

Uses open addressing with linear probing. This makes it very cache friendly and thus very fast.

In the t.Cells array, UnHashedKey = 0 is reserved to indicate an unused cell. Actual value for key 0 (if any) is stored in t.ZeroCell. The hash table automatically doubles in size when it becomes 75% full. The hash table never shrinks in size, even after Clear(), unless you explicitly call Compact().

Basic operations: Lookup(), Insert(), DeleteKey(). These are the equivalent of the builtin map[uint64]interface{}.

As an example of how to specialize for a map[string]*Cell equivalent, see the following functions in the bytekey.go file:

func (t *HashTable) InsertStringKey(strkey string, value interface{}) bool
func (t *HashTable) LookupStringKey(strkey string) (Val_t, bool)
func (t *HashTable) DeleteStringKey(strkey string) bool

Example use:

h := offheap.NewHashTable(2)

// basic three operations are:
h.InsertStringKey("My number", 43)
val, ok := h.LookupStringKey("My number")
h.DeleteStringKey("My number")

Note that this library is only a starting point of source code, and not intended to be used without customization. Users of the HashTable will have to customize it by changing the definitions of Key_t and Val_t to suite their needs.

On Save(), serialization of the HashTable itself is done using msgpack to write bytes to the first page (4k bytes) of the memory mapped file. This uses github.com/tinylib/msgp which is a blazing fast msgpack serialization library. It is fast because it avoids reflection and pre-computes the serializations (using go generate based inspection of your go source). If you need to serialize your values into the Val_t, I would suggest evaluating the msgp for serialization and deserialization. The author, Philip Hofer, has done a terrific job and put alot of effort into tuning it for performance. If you are still pressed for speed, consider also omitting the field labels using the '//msgp:tuple MyValueType' annotation. As Mr. Hofer says, "For smaller objects, tuple encoding can yield serious performance improvements." [https://github.com/tinylib/msgp/wiki/Preprocessor-Directives].

Related ideas:

https://gist.github.com/mish15/9822474 (using CGO)

CGO note: the cgo-malloc branch of this github repo has an implementation that uses CGO to call the malloc/calloc/free functions in the C stdlib. Using CGO gives up the save-to-disk instantly feature and creates a portability issue where you have linked against a specific version of the C stdlib. However if you are making/destroying alot of tables, the CGO approach may be faster. This is because calling malloc and free in the standard C library are much faster than making repeated system calls to mmap().

more related ideas:

https://groups.google.com/forum/#!topic/golang-nuts/kCQP6S6ZGh0

not fully off-heap, but using a slice instead of a map appears to help GC quite alot too:

https://github.com/cespare/kvcache/blob/master/refmap.go

Index

Constants

View Source
const MetadataHeaderMaxBytes = 4096

metadata serialization size can never grow bigger than MetadataHeaderMaxBytes (currently one 4K page), without impacting backwards compatibility. We reserve this many bytes at the beginning of the memory mapped file for the metadata.

Variables

This section is empty.

Functions

This section is empty.

Types

type ByteKeyHashTable

type ByteKeyHashTable HashTable

ByteKeyHashTable shows how to specialize HashTable to handle the []byte type as a key. At present, given the current definition of Key_t (you should redefine Key_t if needed -- see offheap.go), only the first 64 bytes are used in the hash of the key.

func NewByteKeyHashTable

func NewByteKeyHashTable(initialSize int64) *ByteKeyHashTable

NewByteKeyHashTable produces a new ByteKeyHashTable, one specialized for handling []byte as keys.

func (*ByteKeyHashTable) DeleteBK

func (t *ByteKeyHashTable) DeleteBK(bytekey []byte) bool

DeleteBK removes an entry with a []byte key. By default only len(Key_t) bytes are used in the key.

func (*ByteKeyHashTable) InsertBK

func (t *ByteKeyHashTable) InsertBK(bytekey []byte, value interface{}) bool

InsertBK is the insert function for []byte keys. By default only len(Key_t) bytes are used in the key.

func (*ByteKeyHashTable) LookupBK

func (t *ByteKeyHashTable) LookupBK(bytekey []byte) (Val_t, bool)

LookupBK is the lookup function for []byte keys. By default only len(Key_t) bytes are used in the key.

type Cell

type Cell struct {
	UnHashedKey uint64
	ByteKey     Key_t
	Value       Val_t // customize this to hold your value's data type entirely here.
}

Cell is the basic payload struct, stored inline in the HashTable. The cell is returned by the fundamental Lookup() function. The member Value is where the value that corresponds to the key (in ByteKey) is stored. Both the key (in ByteKey) and the value (in Value) are stored inline inside the hashtable, so that all storage for the hashtable is in the same offheap segment. The uint64 key given to fundamental Insert() method is stored in UnHashedKey. The hashed value of the UnHashedKey is not stored in the Cell, but rather computed as needed by the basic Insert() and Lookup() methods.

func (*Cell) DecodeMsg

func (z *Cell) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable

func (*Cell) EncodeMsg

func (z *Cell) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*Cell) GetInt

func (cell *Cell) GetInt() int

GetInt retreives an integer value from the cell.

func (*Cell) GetString

func (cell *Cell) GetString() string

GetString retreives a string value from the cell.Value.

func (*Cell) MarshalMsg

func (z *Cell) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*Cell) Msgsize

func (z *Cell) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*Cell) SetInt

func (cell *Cell) SetInt(n int)

SetInt stores an integer value in the cell.

func (*Cell) SetString

func (cell *Cell) SetString(s string)

SetString stores string s (up to val_t length, currently 56 bytes) in cell.Value.

func (*Cell) SetValue

func (cell *Cell) SetValue(v interface{})

SetValue stores any value v in the Cell. Note that users of the library will need to extend this for their type. Only strings of length less than 56, and integers are handled by default.

func (*Cell) UnmarshalMsg

func (z *Cell) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

func (*Cell) ZeroValue

func (cell *Cell) ZeroValue()

ZeroValue sets the cell's value to all zeros.

type HashTable

type HashTable struct {
	MagicNumber   int     // distinguish between reading from empty file versus an on-disk HashTable.
	Cells         uintptr `msg:"-"`
	CellSz        uint64
	ArraySize     uint64
	Population    uint64
	ZeroUsed      bool
	ZeroCell      Cell
	OffheapHeader []byte     `msg:"-"`
	OffheapCells  []byte     `msg:"-"`
	Mmm           MmapMalloc `msg:"-"`
}

HashTable represents the off-heap hash table. Create a new one with NewHashTable(), then use Lookup(), Insert(), and DeleteKey() on it. HashTable is meant to be customized by the user, to reflect your choice of key and value types. See StringHashTable and ByteKeyHashTable for examples of this specialization process.

func NewHashFileBacked

func NewHashFileBacked(initialSize int64, filepath string) *HashTable

func NewHashTable

func NewHashTable(initialSize int64) *HashTable

Create a new hash table, able to hold initialSize count of keys.

func (*HashTable) CellAt

func (t *HashTable) CellAt(pos uint64) *Cell

CellAt: fetch the cell at a given index. E.g. t.CellAt(pos) replaces t.Cells[pos]

func (*HashTable) Clear

func (t *HashTable) Clear()

Clear does not resize the table, but zeroes-out all entries.

func (*HashTable) Compact

func (t *HashTable) Compact()

Compact will compress the hashtable so that it is at most 75% full.

func (*HashTable) DecodeMsg

func (z *HashTable) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable

func (*HashTable) DeleteCell

func (t *HashTable) DeleteCell(cell *Cell)

DeleteCell deletes the cell pointed to by cell.

func (*HashTable) DeleteKey

func (t *HashTable) DeleteKey(key uint64)

DeleteKey will delete the contents of the cell associated with key.

func (*HashTable) DestroyHashTable

func (t *HashTable) DestroyHashTable()

DestroyHashTable frees the memory-mapping, returning the memory containing the hash table and its cells to the OS. By default the save-to-file-on-disk functionality in malloc.go is not used, but that can be easily activated. See malloc.go. Deferencing any cells/pointers into the hash table after destruction will result in crashing your process, almost surely.

func (*HashTable) Dump

func (t *HashTable) Dump()

Dump provides a diagnostic dump of the full HashTable contents.

func (*HashTable) EncodeMsg

func (z *HashTable) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*HashTable) Insert

func (t *HashTable) Insert(key uint64) (*Cell, bool)

Insert a key and get back the Cell for that key, so as to enable assignment of Value within that Cell, for the specified key. The 2nd return value is false if key already existed (and thus required no addition); if the key already existed you can inspect the existing value in the *Cell returned.

func (*HashTable) InsertBK

func (t *HashTable) InsertBK(bytekey []byte, value interface{}) bool

InsertBK is the insert function for []byte keys. By default only len(Key_t) bytes are used in the key.

func (*HashTable) InsertIntValue

func (t *HashTable) InsertIntValue(key uint64, value int) bool

InsertIntValue inserts value under key in the table.

func (*HashTable) Lookup

func (t *HashTable) Lookup(key uint64) *Cell

Lookup a cell based on a uint64 key value. Returns nil if key not found.

func (*HashTable) LookupBK

func (t *HashTable) LookupBK(bytekey []byte) (Val_t, bool)

func (*HashTable) LookupBKInt

func (t *HashTable) LookupBKInt(bytekey []byte) (int, bool)

func (*HashTable) MarshalMsg

func (z *HashTable) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*HashTable) Msgsize

func (z *HashTable) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*HashTable) NewIterator

func (tab *HashTable) NewIterator() *Iterator

NewIterator creates a new iterator for HashTable tab.

func (*HashTable) Repopulate

func (t *HashTable) Repopulate(desiredSize uint64)

Repopulate expands the hashtable to the desiredSize count of cells.

func (*HashTable) Save

func (t *HashTable) Save(background bool) error

Save syncs the memory mapped file to disk using MmapMalloc::BlockUntilSync(). If background is true, the save using BackgroundSync() instead of blocking.

func (*HashTable) UnmarshalMsg

func (z *HashTable) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

type Iterator

type Iterator struct {
	Tab *HashTable
	Pos int64
	Cur *Cell // will be set to nil when done with iteration.
}

Iterator

sample use: given a HashTable h, enumerate h's contents with:

for it := offheap.NewIterator(h); it.Cur != nil; it.Next() {
  found = append(found, it.Cur.UnHashedKey)
}

func (*Iterator) DecodeMsg

func (z *Iterator) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable

func (*Iterator) Done

func (it *Iterator) Done() bool

Done checks to see if we have already iterated through all cells in the table. Equivalent to checking it.Cur == nil.

func (*Iterator) EncodeMsg

func (z *Iterator) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*Iterator) MarshalMsg

func (z *Iterator) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*Iterator) Msgsize

func (z *Iterator) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*Iterator) Next

func (it *Iterator) Next() *Cell

Next advances the iterator so that it.Cur points to the next filled cell in the table, and returns that cell. Returns nil once there are no more cells to be visited.

func (*Iterator) UnmarshalMsg

func (z *Iterator) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

type Key_t

type Key_t [64]byte

Key_t is the basic type for keys. Users of the library will probably redefine this.

func (*Key_t) DecodeMsg

func (z *Key_t) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable

func (*Key_t) EncodeMsg

func (z *Key_t) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*Key_t) MarshalMsg

func (z *Key_t) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*Key_t) Msgsize

func (z *Key_t) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*Key_t) UnmarshalMsg

func (z *Key_t) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

type MmapMalloc

type MmapMalloc struct {
	Path         string
	File         *os.File
	Fd           int
	FileBytesLen int64
	BytesAlloc   int64
	MMap         gommap.MMap // equiv to Mem, just avoids casts everywhere.
	Mem          []byte      // equiv to Mmap
}

The MmapMalloc struct represents either an anonymous, private region of memory (if path was "", or a memory mapped file if path was supplied to Malloc() at creation.

Malloc() creates and returns an MmapMalloc struct, which can then be later Free()-ed. Malloc() calls request memory directly from the kernel via mmap(). Memory can optionally be backed by a file for simplicity/efficiency of saving to disk.

For use when the Go GC overhead is too large, and you need to move the hash table off-heap.

func Growmap

func Growmap(oldmap *MmapMalloc, newSize int64) (newmap *MmapMalloc, err error)

Growmap grows a memory mapping

func Malloc

func Malloc(numBytes int64, path string) *MmapMalloc

Malloc() creates a new memory region that is provided directly by OS via the mmap() call, and is thus not scanned by the Go garbage collector.

If path is not empty then we memory map to the given path. Otherwise it is just like a call to malloc(): an anonymous memory allocation, outside the realm of the Go Garbage Collector. If numBytes is -1, then we take the size from the path file's size. Otherwise the file is expanded or truncated to be numBytes in size. If numBytes is -1 then a path must be provided; otherwise we have no way of knowing the size to allocate, and the function will panic.

The returned value's .Mem member holds a []byte pointing to the returned memory (as does .MMap, for use in other gommap calls).

func (*MmapMalloc) BackgroundSync

func (mm *MmapMalloc) BackgroundSync()

BackgroundSync() schedules a sync to disk, but may return before it is done. Without a call to either BackgroundSync() or BlockUntilSync(), there is no guarantee that file has ever been written to disk at any point before the munmap() call that happens during Free(). See the man pages msync(2) and mmap(2) for details.

func (*MmapMalloc) BlockUntilSync

func (mm *MmapMalloc) BlockUntilSync()

BlockUntilSync() returns only once the file is synced to disk.

func (*MmapMalloc) Free

func (mm *MmapMalloc) Free()

Free releases the memory allocation back to the OS by removing the (possibly anonymous and private) memroy mapped file that was backing it. Warning: any pointers still remaining will crash the program if dereferenced.

func (*MmapMalloc) TruncateTo

func (mm *MmapMalloc) TruncateTo(newSize int64)

TruncateTo enlarges or shortens the file backing the memory map to be size newSize bytes. It only impacts the file underlying the mapping, not the mapping itself at this point.

type StringHashTable

type StringHashTable HashTable

StringHashTable shows how to specialize HashTable to handle strings as keys. At present, given the current definition of Key_t (you should redefine Key_t if needed -- see offheap.go), only the first 64 bytes are used in the hash of the key.

func NewStringHashTable

func NewStringHashTable(initialSize int64) *StringHashTable

NewStringHashTable produces a new StringHashTable, one specialized for handling keys of type string.

func (*StringHashTable) DeleteStringKey

func (t *StringHashTable) DeleteStringKey(strkey string) bool

DeleteStringKey deletes the cell (if there is one) that has been previously inserted with the given strkey string key.

func (*StringHashTable) DumpStringKey

func (t *StringHashTable) DumpStringKey()

DumpStringKey provides a diagnostic printout of the contents of a hashtable that is using strings as keys.

func (*StringHashTable) InsertStringKey

func (t *StringHashTable) InsertStringKey(strkey string, value interface{}) bool

InsertStringKey inserts a value with a key that is a string.

func (*StringHashTable) LookupStringKey

func (t *StringHashTable) LookupStringKey(strkey string) (Val_t, bool)

LookupStringKey looks up a value based on a key that is a string.

type Val_t

type Val_t [56]byte

Val_t is the basic type for values stored in the cells in the table. Users of the library will probably redefine this to be a different size at the very least.

func (*Val_t) DecodeMsg

func (z *Val_t) DecodeMsg(dc *msgp.Reader) (err error)

DecodeMsg implements msgp.Decodable

func (*Val_t) EncodeMsg

func (z *Val_t) EncodeMsg(en *msgp.Writer) (err error)

EncodeMsg implements msgp.Encodable

func (*Val_t) GetInt

func (v *Val_t) GetInt() int

GetInt gets an int value for Val_t v.

func (*Val_t) GetString

func (v *Val_t) GetString() string

GetString retreives a string value for Val_t v.

func (*Val_t) MarshalMsg

func (z *Val_t) MarshalMsg(b []byte) (o []byte, err error)

MarshalMsg implements msgp.Marshaler

func (*Val_t) Msgsize

func (z *Val_t) Msgsize() (s int)

Msgsize returns an upper bound estimate of the number of bytes occupied by the serialized message

func (*Val_t) SetInt

func (v *Val_t) SetInt(n int)

SetInt sets an int value for Val_t v.

func (*Val_t) SetString

func (v *Val_t) SetString(s string)

SetString sets a string value for Val_t v.

func (*Val_t) UnmarshalMsg

func (z *Val_t) UnmarshalMsg(bts []byte) (o []byte, err error)

UnmarshalMsg implements msgp.Unmarshaler

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL