hclust

package module

v1.0.2 Latest Latest Go to latest Published: Jan 13, 2020 License: MIT Imports: 7 Imported by: 6

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/knightjdr/hclust

Links

Open Source Insights

README ¶

hclust

Package for performing agglomerative hierarchical clustering in Golang.

Methods

Distance matrices can be calculated using the binary, Canberra, Euclidean, Jaccard, Manhattan or maximum metrics. The linkage methods available are: average, centroid, complete, McQuitty, median, single and Ward. The linkage method algorithms used are as recommended in Müllner. Briefly, the single method is implemented using MST, the average, complete, McQuitty and Ward methods are implemented using the nearest-neighbor chain algorithm and the centroid and median methods are implemented using the generic algorithm. Leaf optimization is performed using the improved optimization approach of Bar-Jospeh, et al.

Installation

go get github.com/knightjdr/hclust

Usage

import "github.com/knightjdr/hclust"

Distance

Setting the transpose argument to true will calculate distances between columns as apposed to rows. Euclidean distances will be calculated if an invalid metric is supplied. Valid metric values are: binary, canberra, euclidean, jaccard, manhattan or maximum.

hclust.Distance(matrix [][]float64, metric string, transpose bool) (dist [][]float64)

Cluster

Cluster requires a symmetric distance matrix and a linkage method. It will return a dendrogram with each element in the dendrogram corresponding to a node containing the leafs/subnodes and the length of the branches to the leafs/subnodes. Valid linkage values are: average, centroid, complete, mcquitty, median, single and ward.

type SubCluster struct {
	Leafa   int
	Leafb   int
	Lengtha float64
	Lengthb float64
	Node    int
}

hclust.Cluster(matrix [][]float64, method string) (dendrogram Dendrogram, err error)

Optimize

Optimize takes the dendrogram produced by hclust.Cluster and the distance matrix produced by hclust.Distance and optimizes the leaf ordering. The dendrogram from hclust.Cluster should be input as produced by that method without any modifications.

The running time will be at worst O(n⁴) with the optimzation approach of Bar-Jospeh. Certain (typically very large) datasets, and particular combinations of distance metric and linkage method can produce balanced nodes that require a huge number of comparisons to determine the optimal order. At these nodes the benefit of optimization is low and you can use the ignore argument to skip optimization for them.

To optimize every node in the tree, set the ignore argument to 0. Otherwise set the argument to the number of comparisons beyond which you wish to skip optimization. For example, nodes with 500 leaves on each side could require up to 250 000 test comparisons, so setting the ignore argument to 250 000 will skip these and larger nodes. The ordering that will be used at skipped nodes is the best ordering generated thus far.

The best practice is to run the optimization algorithm optimizing every node and if the run time is too long, adjust this value until an acceptable run time is reached.

hclust.Optimize(dendrogram Dendrogram, dist [][]float64, ignore int) (optimized Dendrogram)

Tree

Tree takes the dendrogram produced by hclust.Cluster or hclust.Optimize and a list of names for the leaves and generates a newick tree. It also returns the names vector sorted based on the clustering order. The input names vector should be in the same order as the rows/columns of the starting matrix/table used for generating the distance matrix. For example, if you are generating a distance matrix between rows of a matrix and those rows are ordered as:

	column1	column2
rowA	10	3.14
rowB	5.4	8
rowC	7	2.1

then the names vector should be []string{"rowA", "rowB", "rowC"}.

The dendrogram from hclust.Cluster or hclust.Optimize should be input as produced by those methods without any modifications.

type TreeLayout struct {
	Newick     string
	Order      []string
}

hclust.Tree(dendrogram Dendrogram, names []string) (tree TreeLayout, err error)

Sort

The hclust.Sort method can be used to sort the original data matrix that was input to hclust.Distance based on the clustering order. The method requires a vector containing the names of the rows/columns in their original order and a vector with the sorted order. The sorted order can be obtained from the hclust.Tree method. The dim argument must be one of "column" or "row". To sort a matrix by both column and row, simply call this method twice (once for columns and once for rows).

hclust.Sort(matrix [][]float64, names, sortOrder []string, dim string) (sorted [][]float64, err error)

Benchmarks

Benchmarking tests were performed using a single core on a 3.7 GHz Quad-Core Intel Xeon E5 processor with 32 GB RAM.

Distance matrix benchmarks were measured using an input table with 4157 rows and 199 columns, with the distances calculated between rows.

Distance metric	Execution time
Canberra	5.1s
Euclidean	1.6s
maximum	2.8s

Clustering benchmarks were measured using a symmetric distance matrix with dimensions 4157×4157.

Linkage method	Execution time
complete	1.7s
median	1.82s
single	1.78s

The Leaf optimization benchmark was measured using a dendrogram with 4157 leafs (4156 internal nodes).

Distance metric	Linkage method	Execution time
maximum	single	13.7s
Euclidean	complete	1m29.4s

Tests

go test

Documentation ¶

Overview ¶

Package hclust contains methods for performing agglomerative hierarchical clustering.

Constants ¶

This section is empty.

Variables ¶

View Source

var Cluster = cluster.Cluster

Cluster references the main cluster method in the cluster subpackage.

View Source

var Distance = distance.Distance

Distance references the main distance method in the distance subpackage.

View Source

var GetNodeHeight = dendrogram.GetNodeHeight

GetNodeHeights gets the height for each dendrogram node by summing child branch lengths.

View Source

var Optimize = optimize.Optimize

Optimize references the main leaf optimization method in the optimize subpackage.

View Source

var Sort = sort.Sort

Sort references the main sort method in the sort subpackage

View Source

var Tree = tree.Create

Tree references the main method for generating the newick tree in the tree subpackage.

Functions ¶

This section is empty.

Types ¶

type Dendrogram ¶ added in v1.0.2

type Dendrogram []SubCluster

Dendrogram is an array of SubClusters.

type SubCluster ¶

type SubCluster = typedef.SubCluster

SubCluster stores the node, distance and names of leafs for a subcluster.

type TreeLayout ¶ added in v1.0.1

type TreeLayout = tree.Tree

TreeLayout contains a tree in newick format and the leaf order.

Source Files ¶

View all Source files

hclust.go

Directories ¶

Path	Synopsis
cluster Package cluster contains methods for clustering.	Package cluster contains methods for clustering.
dendrogram Package dendrogram has helpers for analyzing dendrograms.	Package dendrogram has helpers for analyzing dendrograms.
distance Package distance contains methods to generate a distance matrix.	Package distance contains methods to generate a distance matrix.
matrixop Package matrixop contains methods for operating on matrices.	Package matrixop contains methods for operating on matrices.
optimize Package optimize optimizes the clustering order.	Package optimize optimizes the clustering order.
sort Package sort will sort a matrix by row or column.	Package sort will sort a matrix by row or column.
tree
typedef Package typedef has type definitions used throughout the hclust package.	Package typedef has type definitions used throughout the hclust package.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL