Locality sensetive hashing
Proposal
One of the most important things in machine learning is a problem of fast search in high-dimensional vector spaces.
So the goal of this project is to build a simple approximate nearest neighbors (ANN for short) search service.
We have two basic groups of algorithms to perform the ANN search:
I've decided to go with the LSH algorithm since it's pretty simple to implement and you can store datapoints according to generated hashes in a simple key-value storage.
Local sensitive hashing reference
LSH algorithm implies generation of random plane equation coefs.
// TODO: how the algorithm works
Here are visual examples of the planes generation for angular and non-angular distance metrics:
// TODO: complexety analysis
Building and running
Installing hdf5:
mkdir -p /tmp/setup
apt-get install build-essential && \
wget -q ftp://ftp.unidata.ucar.edu/pub/netcdf/netcdf-4/hdf5-1.8.13.tar.gz && \
tar -xzf hdf5-1.8.13.tar.gz
cd /tmp/setup/hdf5-1.8.13
./configure --prefix=/usr/local && \
make && make install && \
rm -rf /tmp/setup/
// TODO: