goxmeans

package module
v0.0.0-...-a78e909 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 1, 2014 License: BSD-3-Clause Imports: 11 Imported by: 0

README

Overview

An implementation of the x-means algorithm in Go. See Dan Pelleg and Andrew Moore

  • X-Means: Extending K-means with Efficient Estimation of the Number of Clusters.

The goal of goxmeans is to provide a fast and efficient way to cluster unstructured data on commodity hardware. The use of concurrency speeds up the process of model construction and the use of the Bayesian Information Criterion gives a mathematically sound measure of quality. See the included doc/BIC_notes.pdf for an explanation of how we apply this formula.

At its core, goxmeans provides a faster implementation of the classic k-means algorithm, see http://en.wikipedia.org/wiki/Kmeans for an explanation.
kmeans is normally a sequential process of O(n^2) and is run multiple times.
Upon each invocation, the user specifies the number of centroids to use and a single model is returned.

We allow the user to specify the number of centroids, k, to start with and an upper bound, kmax, which when exceeded will cause the program to stop.

The algorithm consists of two operation repeated until completion.

  1. Improve parameters.

  2. Improve structure.

  3. If k > kmax then stop and return a slice of models with BIC scores, else go to 1.

An initial kmeans is run with k centroids and the model is assigned a BIC score. Each of the clusters is then bisected and a BIC score is assigned to this new model. Whichever has a better a BIC score, the parent with one centroid, or the child with two centroids, is kept.

Once k exceeds kmax, a slice of models is returned. Each model has k to kmax + 1 centroids and a BIC score which will tell you how well the clusters fit. See the file in the example sub-directory.

The major acceleration is accomplished via the use of the built in concurrency functions provided by Go. Instead of calculating the distance between each point and each centroid sequentially, we construct a job and send it down a request channel where there as many goroutines to perform the calculations as there are CPUs on the machine so the computations are performed concurrently and sent to a results channel.

Most of the time is currently spent on allocation and garbage collection, so we can improve performance by optimizing the gomatrix library to minimize allocations. Using kdtrees to cache statistical data based on hyperrectangles and their relationship to centroids will avoid redundant calculations. This library is planned for the next major release.

There may be some room for memoization or caching during centroid choice, but efficient caching in a concurrent environment may require a separate library.

Visualization is imperative. We are working on this and expect to release this functionality before the next major release.

We provide two example data sets in the examples sub-directory. On our hardware, floats with mantissas of length of six can take up to 10 times as long as the integer data set. This is a function of the way the CPUs process floating point numbers.

N.B. This has only been tested on Linux.

Installation

Install Go. See golang.org/doc/install

Once you are sure that Go is correctly installed and your GOPATH is correctly set, execute:

go get github.com/bobhancock/goxmeans

Test

cd to $GOPATH/github.com/bobhancock/goxmeans/examples and execute:

go run ./xmeans_ex.go 2 3

to test your installation. Depending on the speed of your machine, this could take up to three minutes. Your output will be something like:

Load complete <-- Finished loading data from file 0: #centroids=2 BIC=-10935854.304219 <-- Model 0 with 2 centroids 1: #centroids=3 BIC=-10918521.272880 <-- Model 1 with 3 centroids

Best fit:[ 1: #centroids=3 BIC=-10918521.272880] <-- Best model cluster-0: numpoints=77908 variance=259369997592716064.000000 <-- Individual clusters cluster-1: numpoints=94254 variance=381034319072047744.000000 <-- of model. cluster-2: numpoints=77838 variance=258988118576487776.000000

We write to stdout for demonstration purposes. In the real world, you would call this from a Go program that would take the contents of the individual clusters of the best model and continue processing. For example, you may want to display the points of each cluster on a scatter plot where each cluster is a different color.

Contributors

Bob Hancock

Dan Frank

Anthony Foglia, Ph.D

Ralph Yozzo

Versions

0.1 Initial release.

0.1a Performance ehnancements, test fixes

Documentation

Overview

Package goxmeans implements a library for the xmeans algorithm.

See Dan Pelleg and Andrew Moore - X-means: Extending K-means with Efficient Estimation of the Number of Clusters.

D = the input set of points

R = |D| the number of points in a model.

M = number of dimensions assuming spherical Gaussians.

The algorithm consists of two operations repeated until completion.

1. Improve parameters

2. Improve structure

3. If K > Kmax then stop and return a slice of Models with BIC scores, else goto 1.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Load

func Load(fname, sep string) (*matrix.DenseMatrix, error)

Load loads a tab delimited text file of floats into a matrix.

Types

type CentroidChooser

type CentroidChooser interface {
	ChooseCentroids(mat *matrix.DenseMatrix, k int) *matrix.DenseMatrix
}

CentroidChooser is the interface that wraps CentroidChooser function.

CetnroidChooser returns a matrix of K coordinates in M dimensions.

type CentroidPoint

type CentroidPoint struct {
	// contains filtered or unexported fields
}

CentroidPoint stores the row number in the centroids matrix and the distance squared between the centroid and the point.

type DataCentroids

type DataCentroids struct{}

DataCentroids picks k distinct points from the dataset as initial centroids.

func (DataCentroids) ChooseCentroids

func (c DataCentroids) ChooseCentroids(mat *matrix.DenseMatrix, k int) *matrix.DenseMatrix

DataCentroids picks k distinct points from the dataset. If k is > points in the matrix then k is set to the number of points.

type EllipseCentroids

type EllipseCentroids struct {
	Frac float64 // must be btw 0 and 1, this will be what fraction of a truly inscribing ellipse this is
}

EllipseCentroids lays out the initial centroids evenly along an elipse inscribed and centered within the boundaries of the dataset. It is only defined for M=2

  • Frac: This must be a float between 0 and 1. It determines the scale of the inscribing ellipse relative to the dataset, so Frac==1.0 produces an ellipse that spans the entire dataset, while Frac==0.5 produces an ellipse spanning half the dataset.

func (EllipseCentroids) ChooseCentroids

func (c EllipseCentroids) ChooseCentroids(mat *matrix.DenseMatrix, k int) *matrix.DenseMatrix

EllipseCentroids lays out the initial centroids evenly along an elipse inscribed and centered within the boundaries of the dataset. It is only defined for M=2

  • Frac: This must be a float between 0 and 1. It determines the scale of the inscribing ellipse relative to the dataset, so Frac==1.0 produces an ellipse that spans the entire dataset, while Frac==0.5 produces an ellipse spanning half the dataset.

type EuclidDist

type EuclidDist vectorDistance

func (EuclidDist) CalcDist

func (ed EuclidDist) CalcDist(p, q *matrix.DenseMatrix) float64

CalcDist finds the Euclidean distance between points.

type ManhattanDist

type ManhattanDist struct{}

func (ManhattanDist) CalcDist

func (md ManhattanDist) CalcDist(a, b *matrix.DenseMatrix) float64

CalcDist finds the ManhattanDistance which is the sum of the aboslute difference of the coordinates. Also known as rectilinear distance, city block distance, or taxicab distance.

type Model

type Model struct {
	Bic      float64
	Clusters []cluster
}

Model is a statistical model with a BIC score and a collection of clusters.

func Xmeans

func Xmeans(datapoints, centroids *matrix.DenseMatrix, k, kmax int, cc, bisectcc CentroidChooser, measurer VectorMeasurer) ([]Model, map[string]error)

Xmeans runs k-means for k lower bound to k upper bound on a data set. Once the k centroids have converged each cluster is bisected and the BIC of the orginal cluster (parent = a model with one centroid) to the the bisected model which consists of two centroids and whichever is greater is committed to the set of clusters for this larger model k.

func (Model) Numcentroids

func (m Model) Numcentroids() int

type PairPointCentroidJob

type PairPointCentroidJob struct {
	// contains filtered or unexported fields
}

PairPointCentroidJobs stores the elements that define the job that pairs a point with a centroid.

func (PairPointCentroidJob) PairPointCentroid

func (job PairPointCentroidJob) PairPointCentroid()

PairPointCentroid pairs a point with the closest centroids.

type PairPointCentroidResult

type PairPointCentroidResult struct {
	// contains filtered or unexported fields
}

PairPointCentroidResult stores the results of pairing a point with a centroid.

type VectorMeasurer

type VectorMeasurer interface {
	CalcDist(a, b *matrix.DenseMatrix) (dist float64)
}

Measurer finds the distance between the points in the columns

Directories

Path Synopsis
An example of how to invoke goxmeans.
An example of how to invoke goxmeans.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL