hdbscan

package module
v0.0.0-...-25a3a22 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 3, 2020 License: Apache-2.0 Imports: 10 Imported by: 0

README

HDBSCAN - Density Clustering Algorithm

HDBSCAN algorithm implementation in golang.

Written to run concurrently on CPU (uses all CPU cores by default).

A re-write of code started by the brilliant developer Edouard Belval at https://github.com/Belval/hdbscan ... although it has changed quite a lot from the original.

Download

go get -u github.com/humilityai/hdbscan

Use

import(
    "github.com/humilityai/hdbscan"
)

func main() {
    data := [][]float64{
        []float64{1,2,3},
        []float64{3,2,1},
    }
    minimumClusterSize := len(data)
    minimumSpanningTree := true

    // create
    clustering, err := hdbscan.NewClustering(data, minimumClusterSize)
    if err != nil {
        panic(err)
    }

    // options
    clustering = clustering.Verbose().OutlierDetection()

    //run
    clustering.Run(hdbscan.EuclideanDistance, hdbscan.VarianceScore, minimumSpanningTree)

    // If using sampling, then can use the Assign() method afterwards on the total dataset.
}
options
  • Verbose() will log the progress of the clustering to stdout.
  • Voronoi() will add all points not placed in a cluster in the final clustering to their nearest cluster. All unassigned data points outliers will be added to their nearest cluster.
  • OutlierDetection() will mark all unassigned data points as outliers of their nearest cluster and provide a NormalizedDistance value for each outlier that can be interpreted as the probability that the data point is an outlier of that cluster.
  • NearestNeighbor() specifies if an unassigned points "nearness" to a cluster should be based on it's nearest assigned neighboring data point in that cluster (default "nearness" is based on distance to centroid of cluster).
  • Subsample(n int) specifies to only use the first n data points in the clustering process. This speeds up the clustering. The remaining data points can be added to clusters using the Assign(data [][]float64) method after a successful clustering.
  • OutlierClustering() will create a new cluster for the outliers of an existing cluster if the number of outliers is equal to or greater than the specified minimum-cluster-size.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// VarianceScore will select an optimal clustering
	// that minimizes the generalized variance across each cluster.
	VarianceScore = "variance_score"
	// StabilityScore will select an optimal clustering that
	// maximized the stability across all clusters.
	StabilityScore = "stability_score"
)
View Source
var (
	// ErrMCS ...
	ErrMCS = errors.New("minimum cluster size is too small")
	// ErrDataLen ...
	ErrDataLen = errors.New("length of data is less than minimum cluster size")
	// ErrRowLength ...
	ErrRowLength = errors.New("row is incorrect length")
)
View Source
var EuclideanDistance = func(v1, v2 []float64) float64 {
	acc := 0.0
	for i, v := range v1 {
		acc += math.Pow((v - v2[i]), 2)
	}
	return math.Pow(acc, 0.5)
}

EuclideanDistance ...

Functions

func GeneralizedVariance

func GeneralizedVariance(rows, columns int, data []float64) float64

GeneralizedVariance will return the determinant of the covariance matrix of the supplied data. The supplied data is a list of 'rows' observations of length 'columns'.

Types

type Clustering

type Clustering struct {
	Clusters clusters
	// contains filtered or unexported fields
}

Clustering struct which holds all final results.

func NewClustering

func NewClustering(data [][]float64, minimumClusterSize int) (*Clustering, error)

NewClustering creates (a pointer to) a new clustering struct. This function does not automatically start the clustering process. The `Run` method needs to be called to do that. Make sure to apply all options *before* calling `Run`.

func (*Clustering) Assign

func (c *Clustering) Assign(data [][]float64) (*Clustering, error)

Assign will assign a list of data points to an existing cluster. If the original clustering had OutlierDetection option enabled then it will perform outlier detection based on existing outliers. The results are returned as a new clustering object with only the indexes from the supplied data. All clusters returned have the same ID as they had in the original clustering. This method can be useful if a sampling was used for the initial clustering and the data points outside of the sample need to be assigned to a cluster as well.

func (*Clustering) NearestNeighbor

func (c *Clustering) NearestNeighbor() *Clustering

NearestNeighbor specifies if nearest-neighbor distances should be used for outlier detection and for voronoi clustering instead of centroid-based distances. NearestNeighbor will find the closest assigned data point to an unassigned data point and consider the unassigned data point to be of that same cluster (as an outlier and/or a point).

func (*Clustering) OutlierClustering

func (c *Clustering) OutlierClustering() *Clustering

OutlierClustering is an option to group the outliers of a cluster into a new cluster if there are at least a minimum-cluster-size number of them. This option will automatically perform outlier detection on the clustering as well.

func (*Clustering) OutlierDetection

func (c *Clustering) OutlierDetection() *Clustering

OutlierDetection will track all unassigned points as outliers of their nearest cluster. It provides a `NormalizedDistance` value for each outlier which can be interpreted as the probability of the point being an outlier (relative to all other outliers).

func (*Clustering) Run

func (c *Clustering) Run(distanceFunc DistanceFunc, score string, mst bool) error

Run will run the clustering.

func (*Clustering) Subsample

func (c *Clustering) Subsample(n int) *Clustering

Subsample will take the first 'n' data points and perform clustering on those. 'n' is a provided argument and should be between 0 and the total data size. Voronoi clustering will be performed after the clusters have been found for all points that are not in the subsample.

func (*Clustering) Verbose

func (c *Clustering) Verbose() *Clustering

Verbose will set verbosity to true for clustering process and the internals of a clustering run will be logged to stdout.

func (*Clustering) Voronoi

func (c *Clustering) Voronoi() *Clustering

Voronoi will set voronoi-clustering to true, and after density clustering is performed, all points not assigned to a cluster will be placed into their nearest cluster (by centroid distance).

type DistanceFunc

type DistanceFunc func(x1, x2 []float64) float64

DistanceFunc ...

type Outlier

type Outlier struct {
	Index              int
	NormalizedDistance float64
}

Outlier struct is used to provide information about an outlier data point.

type Outliers

type Outliers []Outlier

Outliers is an array of outlier points for a given cluster.

func (Outliers) MinProb

func (o Outliers) MinProb() Outlier

MinProb ...

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL