hdbscan

package module

v0.0.0-...-25a3a22 Latest Latest Go to latest Published: Aug 3, 2020 License: Apache-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/humilityai/hdbscan

README ¶

HDBSCAN - Density Clustering Algorithm

HDBSCAN algorithm implementation in golang.

Written to run concurrently on CPU (uses all CPU cores by default).

A re-write of code started by the brilliant developer Edouard Belval at https://github.com/Belval/hdbscan ... although it has changed quite a lot from the original.

Download

go get -u github.com/humilityai/hdbscan

Use

import(
    "github.com/humilityai/hdbscan"
)

func main() {
    data := [][]float64{
        []float64{1,2,3},
        []float64{3,2,1},
    }
    minimumClusterSize := len(data)
    minimumSpanningTree := true

    // create
    clustering, err := hdbscan.NewClustering(data, minimumClusterSize)
    if err != nil {
        panic(err)
    }

    // options
    clustering = clustering.Verbose().OutlierDetection()

    //run
    clustering.Run(hdbscan.EuclideanDistance, hdbscan.VarianceScore, minimumSpanningTree)

    // If using sampling, then can use the Assign() method afterwards on the total dataset.
}

options

Verbose() will log the progress of the clustering to stdout.
Voronoi() will add all points not placed in a cluster in the final clustering to their nearest cluster. All unassigned data points outliers will be added to their nearest cluster.
OutlierDetection() will mark all unassigned data points as outliers of their nearest cluster and provide a NormalizedDistance value for each outlier that can be interpreted as the probability that the data point is an outlier of that cluster.
NearestNeighbor() specifies if an unassigned points "nearness" to a cluster should be based on it's nearest assigned neighboring data point in that cluster (default "nearness" is based on distance to centroid of cluster).
Subsample(n int) specifies to only use the first n data points in the clustering process. This speeds up the clustering. The remaining data points can be added to clusters using the Assign(data [][]float64) method after a successful clustering.
OutlierClustering() will create a new cluster for the outliers of an existing cluster if the number of outliers is equal to or greater than the specified minimum-cluster-size.

Documentation ¶

Index ¶

Variables
func GeneralizedVariance(rows, columns int, data []float64) float64
type Clustering
- func NewClustering(data [][]float64, minimumClusterSize int) (*Clustering, error)
type DistanceFunc
type Outlier
type Outliers
- func (o Outliers) MinProb() Outlier

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// VarianceScore will select an optimal clustering
	// that minimizes the generalized variance across each cluster.
	VarianceScore = "variance_score"
	// StabilityScore will select an optimal clustering that
	// maximized the stability across all clusters.
	StabilityScore = "stability_score"
)

View Source

var (
	// ErrMCS ...
	ErrMCS = errors.New("minimum cluster size is too small")
	// ErrDataLen ...
	ErrDataLen = errors.New("length of data is less than minimum cluster size")
	// ErrRowLength ...
	ErrRowLength = errors.New("row is incorrect length")
)

View Source

var EuclideanDistance = func(v1, v2 []float64) float64 {
	acc := 0.0
	for i, v := range v1 {
		acc += math.Pow((v - v2[i]), 2)
	}
	return math.Pow(acc, 0.5)
}

EuclideanDistance ...

Functions ¶

func GeneralizedVariance ¶

func GeneralizedVariance(rows, columns int, data []float64) float64

GeneralizedVariance will return the determinant of the covariance matrix of the supplied data. The supplied data is a list of 'rows' observations of length 'columns'.

Types ¶

type Clustering ¶

type Clustering struct {
	Clusters clusters
	// contains filtered or unexported fields
}

Clustering struct which holds all final results.

func NewClustering ¶

func NewClustering(data [][]float64, minimumClusterSize int) (*Clustering, error)

NewClustering creates (a pointer to) a new clustering struct. This function does not automatically start the clustering process. The `Run` method needs to be called to do that. Make sure to apply all options *before* calling `Run`.

func (*Clustering) Assign ¶

func (c *Clustering) Assign(data [][]float64) (*Clustering, error)

Assign will assign a list of data points to an existing cluster. If the original clustering had OutlierDetection option enabled then it will perform outlier detection based on existing outliers. The results are returned as a new clustering object with only the indexes from the supplied data. All clusters returned have the same ID as they had in the original clustering. This method can be useful if a sampling was used for the initial clustering and the data points outside of the sample need to be assigned to a cluster as well.

func (*Clustering) NearestNeighbor ¶

func (c *Clustering) NearestNeighbor() *Clustering

NearestNeighbor specifies if nearest-neighbor distances should be used for outlier detection and for voronoi clustering instead of centroid-based distances. NearestNeighbor will find the closest assigned data point to an unassigned data point and consider the unassigned data point to be of that same cluster (as an outlier and/or a point).

func (*Clustering) OutlierClustering ¶

func (c *Clustering) OutlierClustering() *Clustering

OutlierClustering is an option to group the outliers of a cluster into a new cluster if there are at least a minimum-cluster-size number of them. This option will automatically perform outlier detection on the clustering as well.

func (*Clustering) OutlierDetection ¶

func (c *Clustering) OutlierDetection() *Clustering

OutlierDetection will track all unassigned points as outliers of their nearest cluster. It provides a `NormalizedDistance` value for each outlier which can be interpreted as the probability of the point being an outlier (relative to all other outliers).

func (*Clustering) Run ¶

func (c *Clustering) Run(distanceFunc DistanceFunc, score string, mst bool) error

Run will run the clustering.

func (*Clustering) Subsample ¶

func (c *Clustering) Subsample(n int) *Clustering

Subsample will take the first 'n' data points and perform clustering on those. 'n' is a provided argument and should be between 0 and the total data size. Voronoi clustering will be performed after the clusters have been found for all points that are not in the subsample.

func (*Clustering) Verbose ¶

func (c *Clustering) Verbose() *Clustering

Verbose will set verbosity to true for clustering process and the internals of a clustering run will be logged to stdout.

func (*Clustering) Voronoi ¶

func (c *Clustering) Voronoi() *Clustering

Voronoi will set voronoi-clustering to true, and after density clustering is performed, all points not assigned to a cluster will be placed into their nearest cluster (by centroid distance).

type DistanceFunc ¶

type DistanceFunc func(x1, x2 []float64) float64

DistanceFunc ...

type Outlier ¶

type Outlier struct {
	Index              int
	NormalizedDistance float64
}

Outlier struct is used to provide information about an outlier data point.

type Outliers ¶

type Outliers []Outlier

Outliers is an array of outlier points for a given cluster.

func (Outliers) MinProb ¶

func (o Outliers) MinProb() Outlier

MinProb ...

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL