hnsw

package module
v0.0.0-...-aab62e2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 15, 2023 License: MIT Imports: 12 Imported by: 0

README

Go (fast) HNSW

This project is a reference implementation of the Hierarchical Navigable Small World graph paper by Malkov & Yashunin (2018) as a companion to the AWS presentation by Ben Duncan What you need to know about Vector Databases. From use-cases to a deep dive on the technology

NOTE: This repository will be moved to github.com/aws-samples shortly

The project is designed for educational & benchmarking purposes. The project compares the speed of the HNSW graph for k-NN search, the impact of settings to create the HNSW graph (M, mmax, mmax0, efConstruction, efSearch) compared to a naive (brute-search) implementation.

Usage

Build (go > 1.21.0 recommended)

make build

Run the benchmark for 1 million vectors (16 dimensions)

./bin/vecbench -num 1000000 -m 16 -mmax 16 -mmax0 32 -ef 200 -size 16 -csvfile benchmarks/1m.csv

Example output:

Total searches 1000000
Total matches from ground Truth: 9654061
Average 10-NN precision: 0.965406
HNSW efSearch (70):

HNSW Stats:
h.M = 16
h.Mmax = 16
h.Mmax0 = 32
h.Efconstruction = 200
h.Ep = 447724
h.Maxlevel = 5
h.Heuristic = true
h.Ml = 0.360674

Number of nodes = 1000000
        Level 0, number of nodes 937441, number of connections 25710940, avg 25
        Level 1, number of nodes 58480, number of connections 1000928, avg 16
        Level 2, number of nodes 3828, number of connections 65248, avg 16
        Level 3, number of nodes 230, number of connections 4000, avg 16
        Level 4, number of nodes 18, number of connections 320, avg 16
        Level 5, number of nodes 2, number of connections 0, avg 0
Total number of node levels = 6
HNSW search complete in 122.544270 (secs)
HNSW search queries per second 8160.316251 (8 threaded)
HNSW search queries per second 1020.039531 (Single threaded)

Note, the benchmark tool will create the specified number of vectors in a HNSW graph, conduct a brute-search for every element to find the top k-NN (10) and used as a ground-truth reference.

Once complete a HNSW search will run for the entire dataset to find the k-NN with a stepped efSearch paramater (in 10 increments) to reach the HNSW ef paramater used to create the index. This is used to change the accuracy of the search and speed, demonstrating the queries per second (qps) that can be achieved.

Benchmark

To benchmark the results open the Jupyter Notebook benchmarks/gengraph.ipynb replacing

filename = "1m-m16-16d-200ef.csv"

With the CSV file created in the prior step.

1m-m16-16d-200ef.png

In addition, an interactive HTML version will be generated in the benchmarks directory.

Documentation

Index

Constants

View Source
const Efconstruction = 200 // Can be auto-configured using sample data
View Source
const M = 16

Set defaults for HNSW TODO: Auto select best values based on dataset and use ML model to determine

View Source
const Mmax = M
View Source
const Mmax0 = M * 2
View Source
const Version = 1.0

Variables

This section is empty.

Functions

This section is empty.

Types

type HNSW

type HNSW struct {
	Efconstruction int     // Size of the dynamic candidate list
	M              int     // Number of established connections (a reasonable range for M is from 5 to 48, smaller M generally produces better results for lower recalls and/or lower dimensional data. Bigger M is better for high recall and high dimensional data, and determines the memory consumption)
	Mmax           int     // Max number of connections per element/per layer (matches M)
	Mmax0          int     // Max for the 0 layer (Simulations suggest 2*M is a good choice, setting higher leads to performance degradation and excessive memory usage)
	Ml             float64 // Normalization factor for level generation
	Ep             int64   // Top layer of HNSW

	Maxlevel int // Track the current max level used

	Heuristic bool

	NodeList NodeList // Used to store the vectors within each node

	Wg sync.WaitGroup
	// contains filtered or unexported fields
}

func Load

func Load(filename string) (h HNSW, err error)

func New

func New(m int, mmax int, mmax0 int, efconstruction int, vecsize int) (h HNSW, err error)

func (*HNSW) AddConnections

func (h *HNSW) AddConnections(neighbourNode uint32, newNode uint32, level int)

Add links between nodes in the HNSW graph

func (*HNSW) BruteSearch

func (h *HNSW) BruteSearch(q *[]float32, K int) (topCandidates queue.PriorityQueue, err error)

Brute search

func (*HNSW) BruteSearchConcurrent

func (h *HNSW) BruteSearchConcurrent(size int, K int, numWorkers int) (resultChan chan SearchResults, jobs chan SearchQuery, err error)

Search concurrent

func (*HNSW) BruteSearchWorker

func (h *HNSW) BruteSearchWorker(id int, K int, jobs <-chan SearchQuery, resultChan chan<- SearchResults) error

func (*HNSW) FindEp

func (h *HNSW) FindEp(q *[]float32, currentObj *Node, layer int16) (match Node, currentDist float32, err error)

func (*HNSW) GetConnections

func (h *HNSW) GetConnections(ep *Node, level int) []uint32

Get links for a desired entry-point (ep) at a specified layer in the HNSW graph.

func (*HNSW) Insert

func (h *HNSW) Insert(q []float32) (uint32, error)

Input: Multi-layer graph hnsw, new element `q“, number of established connections `h.M“, max number of connections for each element per layer `h.Mmax“, size of the dynamic candidate list `h.efConstruction`, normalised factor for level generation `h.Ml` Output: update h inserting element q

func (*HNSW) InsertConcurrent

func (h *HNSW) InsertConcurrent(size int) (resultChan chan uint32, jobs chan []float32, err error)

Insert concurrent

func (*HNSW) InsertWorker

func (h *HNSW) InsertWorker(id int, jobs <-chan []float32, resultChan chan<- uint32) error

func (*HNSW) KnnSearch

func (h *HNSW) KnnSearch(q Node, K int, ef int) (nearestElements []Node)

Search layer Input: Query element `q`, number of nearest neighbours to return `K`, size of the dynamic candidate list `ef` Output: `nearestElements` to `q`

func (*HNSW) LegacySearchLayer

func (h *HNSW) LegacySearchLayer(q *[]float32, ep *[]float32, C *[]uint32, M int) queue.PriorityQueue

func (*HNSW) PeekNode

func (h *HNSW) PeekNode(id int) Node

Peek a node

func (*HNSW) Save

func (h *HNSW) Save(filename string) (err error)

func (*HNSW) Search

func (h *HNSW) Search(q *[]float32, topCandidates *queue.PriorityQueue, K int, efSearch int) (err error)

Find query point `q` and result `K` results (max-heap)

func (*HNSW) SearchConcurrent

func (h *HNSW) SearchConcurrent(size int, K int, efSearch int, numWorkers int) (resultChan chan SearchResults, jobs chan SearchQuery, err error)

Search concurrent

func (*HNSW) SearchLayer

func (h *HNSW) SearchLayer(q *[]float32, ep *queue.Item, topCandidates *queue.PriorityQueue, ef int, level uint) (err error)

Input: Query element `q`, enter point `ep`, `M` number of nearest to `q“ elements to return, layer number `layerNum` Output: `nearestElements` closest neighbours to `q`

func (*HNSW) SearchWorker

func (h *HNSW) SearchWorker(id int, K int, efSearch int, jobs <-chan SearchQuery, resultChan chan<- SearchResults) error

func (*HNSW) SelectNeighboursHeuristic

func (h *HNSW) SelectNeighboursHeuristic(topCandidates *queue.PriorityQueue, M int, order bool)

func (*HNSW) SelectNeighboursSimple

func (h *HNSW) SelectNeighboursSimple(topCandidates *queue.PriorityQueue, M int)

Input: Candidate elements `C`, number of neighbours to return `M` Output: `M` nearest elements in heap

func (*HNSW) Stats

func (h *HNSW) Stats()

type HNSW_Meta

type HNSW_Meta struct {
	Efconstruction int     // Size of the dynamic candidate list
	M              int     // Number of established connections (a reasonable range for M is from 5 to 48, smaller M generally produces better results for lower recalls and/or lower dimensional data. Bigger M is better for high recall and high dimensional data, and determines the memory consumption)
	Mmax           int     // Max number of connections per element/per layer (matches M)
	Mmax0          int     // Max for the 0 layer (Simulations suggest 2*M is a good choice, setting higher leads to performance degradation and excessive memory usage)
	Ml             float64 // Normalization factor for level generation
	Ep             int64   // Top layer of HNSW

	Maxlevel int // Track the current max level used

	Heuristic bool
}

type Node

type Node struct {
	Connections [][]uint32 // Links to other nodes
	Vectors     []float32
	Layer       int
	Id          uint32
}

type NodeList

type NodeList struct {
	Nodes []Node
	// contains filtered or unexported fields
}

type SearchQuery

type SearchQuery struct {
	Id int
	Qp []float32
}

type SearchResults

type SearchResults struct {
	Id             int
	BestCandidates queue.PriorityQueue
}

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL