goxmeans

package module

v0.0.0-...-a78e909 Latest Latest Go to latest Published: Aug 1, 2014 License: BSD-3-Clause Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/bobhancock/goxmeans

Links

Open Source Insights

README ¶

Overview

An implementation of the x-means algorithm in Go. See Dan Pelleg and Andrew Moore

X-Means: Extending K-means with Efficient Estimation of the Number of Clusters.

The goal of goxmeans is to provide a fast and efficient way to cluster unstructured data on commodity hardware. The use of concurrency speeds up the process of model construction and the use of the Bayesian Information Criterion gives a mathematically sound measure of quality. See the included doc/BIC_notes.pdf for an explanation of how we apply this formula.

At its core, goxmeans provides a faster implementation of the classic k-means algorithm, see http://en.wikipedia.org/wiki/Kmeans for an explanation.
kmeans is normally a sequential process of O(n^2) and is run multiple times.
Upon each invocation, the user specifies the number of centroids to use and a single model is returned.

We allow the user to specify the number of centroids, k, to start with and an upper bound, kmax, which when exceeded will cause the program to stop.

The algorithm consists of two operation repeated until completion.

Improve parameters.
Improve structure.
If k > kmax then stop and return a slice of models with BIC scores, else go to 1.

An initial kmeans is run with k centroids and the model is assigned a BIC score. Each of the clusters is then bisected and a BIC score is assigned to this new model. Whichever has a better a BIC score, the parent with one centroid, or the child with two centroids, is kept.

Once k exceeds kmax, a slice of models is returned. Each model has k to kmax + 1 centroids and a BIC score which will tell you how well the clusters fit. See the file in the example sub-directory.

The major acceleration is accomplished via the use of the built in concurrency functions provided by Go. Instead of calculating the distance between each point and each centroid sequentially, we construct a job and send it down a request channel where there as many goroutines to perform the calculations as there are CPUs on the machine so the computations are performed concurrently and sent to a results channel.

Most of the time is currently spent on allocation and garbage collection, so we can improve performance by optimizing the gomatrix library to minimize allocations. Using kdtrees to cache statistical data based on hyperrectangles and their relationship to centroids will avoid redundant calculations. This library is planned for the next major release.

There may be some room for memoization or caching during centroid choice, but efficient caching in a concurrent environment may require a separate library.

Visualization is imperative. We are working on this and expect to release this functionality before the next major release.

We provide two example data sets in the examples sub-directory. On our hardware, floats with mantissas of length of six can take up to 10 times as long as the integer data set. This is a function of the way the CPUs process floating point numbers.

N.B. This has only been tested on Linux.

Installation

Install Go. See golang.org/doc/install

Once you are sure that Go is correctly installed and your GOPATH is correctly set, execute:

go get github.com/bobhancock/goxmeans

Test

cd to $GOPATH/github.com/bobhancock/goxmeans/examples and execute:

go run ./xmeans_ex.go 2 3

to test your installation. Depending on the speed of your machine, this could take up to three minutes. Your output will be something like:

Load complete <-- Finished loading data from file 0: #centroids=2 BIC=-10935854.304219 <-- Model 0 with 2 centroids 1: #centroids=3 BIC=-10918521.272880 <-- Model 1 with 3 centroids

Best fit:[ 1: #centroids=3 BIC=-10918521.272880] <-- Best model cluster-0: numpoints=77908 variance=259369997592716064.000000 <-- Individual clusters cluster-1: numpoints=94254 variance=381034319072047744.000000 <-- of model. cluster-2: numpoints=77838 variance=258988118576487776.000000

We write to stdout for demonstration purposes. In the real world, you would call this from a Go program that would take the contents of the individual clusters of the best model and continue processing. For example, you may want to display the points of each cluster on a scatter plot where each cluster is a different color.

Contributors

Bob Hancock

Dan Frank

Anthony Foglia, Ph.D

Ralph Yozzo

Versions

0.1 Initial release.

0.1a Performance ehnancements, test fixes

Documentation ¶

Overview ¶

Package goxmeans implements a library for the xmeans algorithm.

See Dan Pelleg and Andrew Moore - X-means: Extending K-means with Efficient Estimation of the Number of Clusters.

D = the input set of points

R = |D| the number of points in a model.

M = number of dimensions assuming spherical Gaussians.

The algorithm consists of two operations repeated until completion.

1. Improve parameters

2. Improve structure

3. If K > Kmax then stop and return a slice of Models with BIC scores, else goto 1.

Index ¶

func Load(fname, sep string) (*matrix.DenseMatrix, error)
type CentroidChooser
type CentroidPoint
type DataCentroids
- func (c DataCentroids) ChooseCentroids(mat *matrix.DenseMatrix, k int) *matrix.DenseMatrix
type EllipseCentroids
- func (c EllipseCentroids) ChooseCentroids(mat *matrix.DenseMatrix, k int) *matrix.DenseMatrix
type EuclidDist
- func (ed EuclidDist) CalcDist(p, q *matrix.DenseMatrix) float64
type ManhattanDist
- func (md ManhattanDist) CalcDist(a, b *matrix.DenseMatrix) float64
type Model
- func Xmeans(datapoints, centroids *matrix.DenseMatrix, k, kmax int, ...) ([]Model, map[string]error)
- func (m Model) Numcentroids() int
type PairPointCentroidJob
- func (job PairPointCentroidJob) PairPointCentroid()
type PairPointCentroidResult
type VectorMeasurer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Load ¶

func Load(fname, sep string) (*matrix.DenseMatrix, error)

Load loads a tab delimited text file of floats into a matrix.

Types ¶

type CentroidChooser ¶

type CentroidChooser interface {
	ChooseCentroids(mat *matrix.DenseMatrix, k int) *matrix.DenseMatrix
}

CentroidChooser is the interface that wraps CentroidChooser function.

CetnroidChooser returns a matrix of K coordinates in M dimensions.

type CentroidPoint ¶

type CentroidPoint struct {
	// contains filtered or unexported fields
}

CentroidPoint stores the row number in the centroids matrix and the distance squared between the centroid and the point.

type DataCentroids ¶

type DataCentroids struct{}

DataCentroids picks k distinct points from the dataset as initial centroids.

func (DataCentroids) ChooseCentroids ¶

func (c DataCentroids) ChooseCentroids(mat *matrix.DenseMatrix, k int) *matrix.DenseMatrix

DataCentroids picks k distinct points from the dataset. If k is > points in the matrix then k is set to the number of points.

type EllipseCentroids ¶

type EllipseCentroids struct {
	Frac float64 // must be btw 0 and 1, this will be what fraction of a truly inscribing ellipse this is
}

EllipseCentroids lays out the initial centroids evenly along an elipse inscribed and centered within the boundaries of the dataset. It is only defined for M=2

Frac: This must be a float between 0 and 1. It determines the scale of the inscribing ellipse relative to the dataset, so Frac==1.0 produces an ellipse that spans the entire dataset, while Frac==0.5 produces an ellipse spanning half the dataset.

func (EllipseCentroids) ChooseCentroids ¶

func (c EllipseCentroids) ChooseCentroids(mat *matrix.DenseMatrix, k int) *matrix.DenseMatrix

EllipseCentroids lays out the initial centroids evenly along an elipse inscribed and centered within the boundaries of the dataset. It is only defined for M=2

Frac: This must be a float between 0 and 1. It determines the scale of the inscribing ellipse relative to the dataset, so Frac==1.0 produces an ellipse that spans the entire dataset, while Frac==0.5 produces an ellipse spanning half the dataset.

type EuclidDist ¶

type EuclidDist vectorDistance

func (EuclidDist) CalcDist ¶

func (ed EuclidDist) CalcDist(p, q *matrix.DenseMatrix) float64

CalcDist finds the Euclidean distance between points.

type ManhattanDist ¶

type ManhattanDist struct{}

func (ManhattanDist) CalcDist ¶

func (md ManhattanDist) CalcDist(a, b *matrix.DenseMatrix) float64

CalcDist finds the ManhattanDistance which is the sum of the aboslute difference of the coordinates. Also known as rectilinear distance, city block distance, or taxicab distance.

type Model ¶

type Model struct {
	Bic      float64
	Clusters []cluster
}

Model is a statistical model with a BIC score and a collection of clusters.

func Xmeans ¶

func Xmeans(datapoints, centroids *matrix.DenseMatrix, k, kmax int, cc, bisectcc CentroidChooser, measurer VectorMeasurer) ([]Model, map[string]error)

Xmeans runs k-means for k lower bound to k upper bound on a data set. Once the k centroids have converged each cluster is bisected and the BIC of the orginal cluster (parent = a model with one centroid) to the the bisected model which consists of two centroids and whichever is greater is committed to the set of clusters for this larger model k.

func (Model) Numcentroids ¶

func (m Model) Numcentroids() int

type PairPointCentroidJob ¶

type PairPointCentroidJob struct {
	// contains filtered or unexported fields
}

PairPointCentroidJobs stores the elements that define the job that pairs a point with a centroid.

func (PairPointCentroidJob) PairPointCentroid ¶

func (job PairPointCentroidJob) PairPointCentroid()

PairPointCentroid pairs a point with the closest centroids.

type PairPointCentroidResult ¶

type PairPointCentroidResult struct {
	// contains filtered or unexported fields
}

PairPointCentroidResult stores the results of pairing a point with a centroid.

type VectorMeasurer ¶

type VectorMeasurer interface {
	CalcDist(a, b *matrix.DenseMatrix) (dist float64)
}

Measurer finds the distance between the points in the columns

Source Files ¶

View all Source files

km.go

Directories ¶

Path	Synopsis
examples An example of how to invoke goxmeans.	An example of how to invoke goxmeans.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL