cluster

package
v0.0.0-...-00e0c84 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 15, 2022 License: MIT Imports: 9 Imported by: 1

README

Clustering Algorithms (Supervised and Unsupervised)

import "github.com/cdipaolo/goml/cluster"

GoDoc

This part of the goml package implements clustering algorithms, unsupervised and supervised, to give the user options about what model they want to use.

implemented models

  • k-means clustering
    • Uses k-means++ instantiation for more reliable clustering (this paper outlines the method)
    • Both online and batch versions of the algorithm
    • Online version implements the algorithm discussed in this paper
  • triangle inequality accelerated k-means clusering
    • Implements the algorithm described in this paper by Charles Elkan of the University of California, San Diego to use upper and lower bounds on distances to clusters across iterations to dramatically reduce the number of (potentially really expensive) distance calculations made by the algorithm.
    • Uses k-means++ instantiation for more reliable clustering (this paper outlines the method)
  • n-nearest-neighbors clustering
    • Can use any distance metric, with L-p Norm, Euclidean Distance, and Manhattan Distance pre-defined within the goml/base package

example k-means model usage

This code produces four clusters (as expected,) which result in the following plot (made with ggplot2).

Clusterd By K

gaussian := [][]float64{}
for i := 0; i < 40; i++ {
	x := rand.NormFloat64() + 4
	y := rand.NormFloat64()*0.25 + 5
	gaussian = append(gaussian, []float64{x, y})
}
for i := 0; i < 66; i++ {
	x := rand.NormFloat64()
	y := rand.NormFloat64() + 10
	gaussian = append(gaussian, []float64{x, y})
}
for i := 0; i < 100; i++ {
	x := rand.NormFloat64()*3 - 10
	y := rand.NormFloat64()*0.25 - 7
	gaussian = append(gaussian, []float64{x, y})
}
for i := 0; i < 23; i++ {
	x := rand.NormFloat64() * 2
	y := rand.NormFloat64() - 1.25
	gaussian = append(gaussian, []float64{x, y})
}

model := NewKMeans(4, 15, gaussian)

if model.Learn() != nil {
	panic("Oh NO!!! There was an error learning!!")
}

// now you can predict like normal!
guess, err := model.Predict([]float64{-3, 6})
if err != nil {
	panic("prediction error")
}

// or if you want to get the clustering
// results from the data
results := model.Guesses()

// you can also concat that with the
// training set and save it to a file
// (if you wanted to plot it or something)
err = model.SaveClusteredData("/tmp/.goml/KMeansResults.csv")
if err != nil {
	panic("file save error")
}

// you can also persist the model to a
// file
err = model.PersistToFile("/tmp/.goml/KMeans.json")
if err != nil {
	panic("file save error")
}

// and also restore from file (at a
// later time if you want)
err = model.RestoreFromFile("/tmp/.goml/KMeans.json")
if err != nil {
	panic("file save error")
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type KMeans

type KMeans struct {
	Centroids [][]float64 `json:"centroids"`

	// Output is the io.Writer to write
	// logging to. Defaults to os.Stdout
	// but can be changed to any io.Writer
	Output io.Writer
	// contains filtered or unexported fields
}

KMeans implements the k-means unsupervised clustering algorithm. The batch version of the model used k=means++ as the instantiation of the model. The online version doesn't, because that wouldn't make sense!

https://en.wikipedia.org/wiki/K-means_clustering

Example KMeans Model Usage:

// initialize data with 2 clusters
double := [][]float64{}
for i := -10.0; i < -3; i += 0.1 {
	for j := -10.0; j < 10; j += 0.1 {
		double = append(double, []float64{i, j})
	}
}

for i := 3.0; i < 10; i += 0.1 {
	for j := -10.0; j < 10; j += 0.1 {
		double = append(double, []float64{i, j})
	}
}

model := NewKMeans(2, 30, double)

if model.Learn() != nil {
	panic("Oh NO!!! There was an error learning!!")
}

// now predict with the same training set and
// make sure the classes are the same within
// each block
c1, err := model.Predict([]float64{-7.5, 0})
if err != nil {
	panic("prediction error")
}

c2, err := model.Predict([]float64{7.5, 0})
if err != nil {
	panic("prediction error")
}

// now you can predict like normal!
guess, err := model.Predict([]float64{-3, 6})
if err != nil {
	panic("prediction error")
}

// or if you want to get the clustering
// results from the data
results := model.Guesses()

// you can also concat that with the
// training set and save it to a file
// (if you wanted to plot it or something)
err = model.SaveClusteredData("/tmp/.goml/KMeansResults.csv")
if err != nil {
	panic("file save error")
}

// you can also persist the model to a
// file
err = model.PersistToFile("/tmp/.goml/KMeans.json")
if err != nil {
	panic("file save error")
}

// and also restore from file (at a
// later time if you want)
err = model.RestoreFromFile("/tmp/.goml/KMeans.json")
if err != nil {
	panic("file save error")
}

func NewKMeans

func NewKMeans(k, maxIterations int, trainingSet [][]float64, params ...OnlineParams) *KMeans

NewKMeans returns a pointer to the k-means model, which clusters given inputs in an unsupervised manner. The algorithm only has one optimization method (unless learning with the online variant which is more of a generalization than the same algorithm) so you aren't allowed to pass one in as an option.

n is an optional parameter which (if given) assigns the length of the input vector.

func (*KMeans) Distortion

func (k *KMeans) Distortion() float64

Distortion returns the distortion of the clustering currently given by the k-means model. This is the function the learning algorithm tries to minimize.

Distorition() = Σ |x[i] - μ[c[i]]|^2 over all training examples

func (*KMeans) Examples

func (k *KMeans) Examples() int

Examples returns the number of training examples (m) that the model currently is training from.

func (*KMeans) Guesses

func (k *KMeans) Guesses() []int

Guesses returns the hidden parameter for the unsupervised classification assigned during learning.

model.Guesses[i] = E[k.trainingSet[i]]

func (*KMeans) Learn

func (k *KMeans) Learn() error

Learn takes the struct's dataset and expected results and runs batch gradient descent on them, optimizing theta so you can predict based on those results

This batch version of the model uses the k-means++ instantiation method to generate a consistantly better model than regular, randomized instantiation of centroids. Paper: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

func (*KMeans) LearningRate

func (k *KMeans) LearningRate() float64

LearningRate returns the learning rate α for gradient descent to optimize the model. Could vary as a function of something else later, potentially.

func (*KMeans) MaxIterations

func (k *KMeans) MaxIterations() int

MaxIterations returns the number of maximum iterations the model will go through in GradientAscent, in the worst case

func (*KMeans) OnlineLearn

func (k *KMeans) OnlineLearn(errors chan error, dataset chan base.Datapoint, onUpdate func([][]float64), normalize ...bool)

OnlineLearn implements a variant of the K-Means learning algorithm to work with streams of data. The basis of the model is discusses within this (http://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/projects/MIT15_097S12_proj1.pdf) paper by an MIT student, along with some theoretical assurances of the quality of learning.

The onUpdate callback will be called in a separate goroutine whenever the model updates a centroid of the cluster. The callback will pass two items within the array: an array containing the class number (only) of the cluster updated, and the new centroid vector for that class

Ex: [[2.0], [1.23, 4.271, 6.013, 7.20312]]

The algorithm performs the following update:

  1. Get new point x
  2. Determine the closest cluster μ[i] to point x
  3. Update the cluster center: μ[i] := μ[i] + α(x - μ[i])

NOTE that this is an unsupervised model! You DO NOT need to pass in the Y param of the datapoints!

Example Online K-Means Model:

model := NewKMeans(4, 0, nil, OnlineParams{
    alpha:    0.5,
    features: 4,
})

go model.OnlineLearn(errors, stream, func(theta [][]float64) {})

go func() {
    // start passing data to our datastream
    //
    // we could have data already in our channel
    // when we instantiated the model, though
    for i := -40.0; i < -30; i += 4.99 {
        for j := -40.0; j < -30; j += 4.99 {
            for k := -40.0; k < -30; k += 4.99 {
                for l := -40.0; l < -30; l += 4.99 {
                    stream <- base.Datapoint{
                        X: []float64{i, j, k, l},
                    }
                }
            }
        }
    }
    for i := -40.0; i < -30; i += 4.99 {
        for j := 30.0; j < 40; j += 4.99 {
            for k := -40.0; k < -30; k += 4.99 {
                for l := 30.0; l < 40; l += 4.99 {
                    stream <- base.Datapoint{
                        X: []float64{i, j, k, l},
                    }
                }
            }
        }
    }
    for i := 30.0; i < 40; i += 4.99 {
        for j := -40.0; j < -30; j += 4.99 {
            for k := 30.0; k < 40; k += 4.99 {
                for l := -40.0; l < -30; l += 4.99 {
                    stream <- base.Datapoint{
                        X: []float64{i, j, k, l},
                    }
                }
            }
        }
    }
    for i := 30.0; i < 40; i += 4.99 {
        for j := -40.0; j < -30; j += 4.99 {
            for k := -40.0; k < -30; k += 4.99 {
                for l := 30.0; l < 40; l += 4.99 {
                    stream <- base.Datapoint{
                        X: []float64{i, j, k, l},
                    }
                }
            }
        }
    }

    // close the dataset
    close(stream)
}()

// this will block until the error
// channel is closed in the learning
// function (it will, don't worry!)
for {
    err, more := <-errors
    if err != nil {
        panic("THERE WAS AN ERROR!!! RUN!!!!")
      }
    if !more {
        break
    }
}

// Below here all the learning is completed

// predict like usual
guess, err = model.Predict([]float64{42,6,10,-32})
if err != nil {
    panic("AAAARGGGH! SHIVER ME TIMBERS! THESE ROTTEN SCOUNDRELS FOUND AN ERROR!!!")
}

func (*KMeans) PersistToFile

func (k *KMeans) PersistToFile(path string) error

PersistToFile takes in an absolute filepath and saves the centroid vector to the file, which can be restored later. The function will take paths from the current directory, but functions

The data is stored as JSON because it's one of the most efficient storage method (you only need one comma extra per feature + two brackets, total!) And it's extendable.

func (*KMeans) Predict

func (k *KMeans) Predict(x []float64, normalize ...bool) ([]float64, error)

Predict takes in a variable x (an array of floats,) and finds the value of the hypothesis function given the current parameter vector θ

if normalize is given as true, then the input will first be normalized to unit length. Only use this if you trained off of normalized inputs and are feeding an un-normalized input

func (*KMeans) RestoreFromFile

func (k *KMeans) RestoreFromFile(path string) error

RestoreFromFile takes in a path to a centroid vector and assigns the model it's operating on's parameter vector to that.

The path must ba an absolute path or a path from the current directory

This would be useful in persisting data between running a model on data, or for graphing a dataset with a fit in another framework like Julia/Gadfly.

func (*KMeans) SaveClusteredData

func (k *KMeans) SaveClusteredData(filepath string) error

SaveClusteredData takes operates on a k-means model, concatenating the given dataset with the assigned class from clustering and saving it to file.

Basically just a wrapper for the base.SaveDataToCSV with the K-Means data.

func (*KMeans) String

func (k *KMeans) String() string

String implements the fmt interface for clean printing. Here we're using it to print the model as the equation h(θ)=... where h is the k-means hypothesis model

func (*KMeans) UpdateLearningRate

func (k *KMeans) UpdateLearningRate(a float64)

UpdateLearningRate set's the learning rate of the model to the given float64.

func (*KMeans) UpdateTrainingSet

func (k *KMeans) UpdateTrainingSet(trainingSet [][]float64) error

UpdateTrainingSet takes in a new training set (variable x.)

Will reset the hidden 'guesses' param of the KMeans model.

type KNN

type KNN struct {
	// Distance holds the distance
	// measure for the KNN algorithm,
	// which is just a function that
	// maps 2 float64 vectors to a
	// float64
	Distance base.DistanceMeasure

	// K is the number of nearest
	// neighbors to classify based
	// on in the KNN prediction
	// algorithm
	K int
	// contains filtered or unexported fields
}

KNN implements the KNN algorithm for classification, where an input is classified by finding the K nearest (by some distance metric) data points, and taking a vote based on those.

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Example K-Nearest-Neighbors Model Usage:

// initialize data!
twoClusters := [][]float64{}
twoClustersY := []float64{}
for i := -10.0; i < -3; i += 0.1 {
	for j := -10.0; j < 10; j += 0.1 {
		twoClusters = append(twoClusters, []float64{i, j})
		twoClustersY = append(twoClustersY, 0.0)
	}
}

for i := 3.0; i < 10; i += 0.1 {
	for j := -10.0; j < 10; j += 0.1 {
		twoClusters = append(twoClusters, []float64{i, j})
		twoClustersY = append(twoClustersY, 1.0)
	}
}

// create the model using 3 nearest neighbors
// for prediction, using the Euclidean Distance
// as the distance metric.
model := NewKNN(3, twoClusters, twoClustersY, base.EuclideanDistance)

// make predictions like usual
guess, err := model.Predict([]float64{-10,1})
if err != nil {
	panic("THERE WAS AN ERROR")
}

// update the K used (use 10 neighbors now)
model.K = 10

func NewKNN

func NewKNN(k int, trainingSet [][]float64, expectedResults []float64, distanceMeasure base.DistanceMeasure) *KNN

NewKNN returns a pointer to the k-means model, which clusters given inputs in an unsupervised manner. The algorithm only has one optimization method (unless learning with the online variant which is more of a generalization than the same algorithm) so you aren't allowed to pass one in as an option.

n is an optional parameter which (if given) assigns the length of the input vector.

func (*KNN) Examples

func (k *KNN) Examples() int

Examples returns the number of training examples (m) that the model currently is holding

func (*KNN) Predict

func (k *KNN) Predict(x []float64, normalize ...bool) ([]float64, error)

Predict takes in a variable x (an array of floats,) and finds the value of the hypothesis function given the current parameter vector θ

if normalize is given as true, then the input will first be normalized to unit length. Only use this if you trained off of normalized inputs and are feeding an un-normalized input

func (*KNN) UpdateTrainingSet

func (k *KNN) UpdateTrainingSet(trainingSet [][]float64, expectedResults []float64) error

UpdateTrainingSet takes in a new training set (variable x.)

type OnlineParams

type OnlineParams struct {
	Alpha    float64
	Features int
}

OnlineParams is used to pass optional parameters in to creating a new K-Means model if you want to learn using the online version of the model

type TriangleKMeans

type TriangleKMeans struct {
	Centroids [][]float64 `json:"centroids"`

	// Output is the io.Writer to write logs
	// and output from training to
	Output io.Writer
	// contains filtered or unexported fields
}

TriangleKMeans implements the k-means unsupervised clustering algorithm sped up to use the Triangle Inequality to reduce the number of reduntant distance calculations between datapoints and clusters. The revised algorithm does use O(n) auxillary space (more like 3n + k^2), and computes a lot on these extra data structures, but

	"Our algorithm reduces the number of distance
	 calculations so dramatically that its overhead
	 time is often greater than the time spent on
	 distance calculations. However, the total
	 execution time is always much less than the
	 time required by standard k-means"
    (Eklan 2003, University of California, San Diego)

Note that this algorithm also uses k-means++ instantiation for more reliable clustering. The Triangle Inequality optimizations operate on a different sector of the algorithm.

http://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf https://en.wikipedia.org/wiki/K-means_clustering

Example KMeans Model Usage:

// initialize data with 2 clusters
double := [][]float64{}
for i := -10.0; i < -3; i += 0.1 {
	for j := -10.0; j < 10; j += 0.1 {
		double = append(double, []float64{i, j})
	}
}

for i := 3.0; i < 10; i += 0.1 {
	for j := -10.0; j < 10; j += 0.1 {
		double = append(double, []float64{i, j})
	}
}

// note that this is the only
// different line from regular k-means
model := NewTriangleKMeans(2, 30, double)

if model.Learn() != nil {
	panic("Oh NO!!! There was an error learning!!")
}

// now predict with the same training set and
// make sure the classes are the same within
// each block
c1, err := model.Predict([]float64{-7.5, 0})
if err != nil {
	panic("prediction error")
}

c2, err := model.Predict([]float64{7.5, 0})
if err != nil {
	panic("prediction error")
}

// now you can predict like normal!
guess, err := model.Predict([]float64{-3, 6})
if err != nil {
	panic("prediction error")
}

// or if you just want to get the clustering
// results from the data
results := model.Guesses()

// you can also concat that with the
// training set and save it to a file
// (if you wanted to plot it or something)
err = model.SaveClusteredData("/tmp/.goml/KMeansResults.csv")
if err != nil {
	panic("file save error")
}

// you can also persist the model to a
// file
err = model.PersistToFile("/tmp/.goml/KMeans.json")
if err != nil {
	panic("file save error")
}

// and also restore from file (at a
// later time if you want)
err = model.RestoreFromFile("/tmp/.goml/KMeans.json")
if err != nil {
	panic("file save error")
}

func NewTriangleKMeans

func NewTriangleKMeans(k, maxIterations int, trainingSet [][]float64) *TriangleKMeans

NewTriangleKMeans returns a pointer to the k-means model, which clusters given inputs in an unsupervised manner. The differences between this algorithm and the standard k-means++ algorithm implemented in NewKMeans are discribed in the struct comments and in the paper URL found within those.

func (*TriangleKMeans) Distortion

func (k *TriangleKMeans) Distortion() float64

Distortion returns the distortion of the clustering currently given by the k-means model. This is the function the learning algorithm tries to minimize.

Distorition() = Σ |x[i] - μ[c[i]]|^2 over all training examples

func (*TriangleKMeans) Examples

func (k *TriangleKMeans) Examples() int

Examples returns the number of training examples (m) that the model currently is training from.

func (*TriangleKMeans) Guesses

func (k *TriangleKMeans) Guesses() []int

Guesses returns the hidden parameter for the unsupervised classification assigned during learning.

model.Guesses[i] = E[k.trainingSet[i]]

func (*TriangleKMeans) Learn

func (k *TriangleKMeans) Learn() error

Learn takes the struct's dataset and expected results and runs batch gradient descent on them, optimizing theta so you can predict based on those results

This batch version of the model uses the k-means++ instantiation method to generate a consistantly better model than regular, randomized instantiation of centroids. Paper: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

As stated many times for TriangleKMeans (check out the struct comments) this model uses the Triangle Inequality to decrease significantly the number of required distance calculations. The origininal paper is seen here:

http://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf

func (*TriangleKMeans) MaxIterations

func (k *TriangleKMeans) MaxIterations() int

MaxIterations returns the number of maximum iterations the model will go through

func (*TriangleKMeans) PersistToFile

func (k *TriangleKMeans) PersistToFile(path string) error

PersistToFile takes in an absolute filepath and saves the centroid vector to the file, which can be restored later. The function will take paths from the current directory, but functions

The data is stored as JSON because it's one of the most efficient storage method (you only need one comma extra per feature + two brackets, total!) And it's extendable.

func (*TriangleKMeans) Predict

func (k *TriangleKMeans) Predict(x []float64, normalize ...bool) ([]float64, error)

Predict takes in a variable x (an array of floats,) and finds the value of the hypothesis function given the current parameter vector θ

if normalize is given as true, then the input will first be normalized to unit length. Only use this if you trained off of normalized inputs and are feeding an un-normalized input

func (*TriangleKMeans) RestoreFromFile

func (k *TriangleKMeans) RestoreFromFile(path string) error

RestoreFromFile takes in a path to a centroid vector and assigns the model it's operating on's parameter vector to that.

The path must ba an absolute path or a path from the current directory

This would be useful in persisting data between running a model on data, or for graphing a dataset with a fit in another framework like Julia/Gadfly.

func (*TriangleKMeans) SaveClusteredData

func (k *TriangleKMeans) SaveClusteredData(filepath string) error

SaveClusteredData takes operates on a k-means model, concatenating the given dataset with the assigned class from clustering and saving it to file.

Basically just a wrapper for the base.SaveDataToCSV with the K-Means data.

func (*TriangleKMeans) String

func (k *TriangleKMeans) String() string

String implements the fmt interface for clean printing. Here we're using it to print the model as the equation h(θ)=... where h is the k-means hypothesis model

func (*TriangleKMeans) UpdateTrainingSet

func (k *TriangleKMeans) UpdateTrainingSet(trainingSet [][]float64) error

UpdateTrainingSet takes in a new training set (variable x.)

Will reset the hidden 'guesses' param of the KMeans model.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL