hclust

package module
v1.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 13, 2020 License: MIT Imports: 7 Imported by: 6

README

hclust

Package for performing agglomerative hierarchical clustering in Golang.

Methods

Distance matrices can be calculated using the binary, Canberra, Euclidean, Jaccard, Manhattan or maximum metrics. The linkage methods available are: average, centroid, complete, McQuitty, median, single and Ward. The linkage method algorithms used are as recommended in Müllner. Briefly, the single method is implemented using MST, the average, complete, McQuitty and Ward methods are implemented using the nearest-neighbor chain algorithm and the centroid and median methods are implemented using the generic algorithm. Leaf optimization is performed using the improved optimization approach of Bar-Jospeh, et al.

Installation

go get github.com/knightjdr/hclust

Usage

import "github.com/knightjdr/hclust"

Distance

Setting the transpose argument to true will calculate distances between columns as apposed to rows. Euclidean distances will be calculated if an invalid metric is supplied. Valid metric values are: binary, canberra, euclidean, jaccard, manhattan or maximum.

hclust.Distance(matrix [][]float64, metric string, transpose bool) (dist [][]float64)

Cluster

Cluster requires a symmetric distance matrix and a linkage method. It will return a dendrogram with each element in the dendrogram corresponding to a node containing the leafs/subnodes and the length of the branches to the leafs/subnodes. Valid linkage values are: average, centroid, complete, mcquitty, median, single and ward.

type SubCluster struct {
	Leafa   int
	Leafb   int
	Lengtha float64
	Lengthb float64
	Node    int
}

hclust.Cluster(matrix [][]float64, method string) (dendrogram Dendrogram, err error)
Optimize

Optimize takes the dendrogram produced by hclust.Cluster and the distance matrix produced by hclust.Distance and optimizes the leaf ordering. The dendrogram from hclust.Cluster should be input as produced by that method without any modifications.

The running time will be at worst O(n4) with the optimzation approach of Bar-Jospeh. Certain (typically very large) datasets, and particular combinations of distance metric and linkage method can produce balanced nodes that require a huge number of comparisons to determine the optimal order. At these nodes the benefit of optimization is low and you can use the ignore argument to skip optimization for them.

To optimize every node in the tree, set the ignore argument to 0. Otherwise set the argument to the number of comparisons beyond which you wish to skip optimization. For example, nodes with 500 leaves on each side could require up to 250 000 test comparisons, so setting the ignore argument to 250 000 will skip these and larger nodes. The ordering that will be used at skipped nodes is the best ordering generated thus far.

The best practice is to run the optimization algorithm optimizing every node and if the run time is too long, adjust this value until an acceptable run time is reached.

hclust.Optimize(dendrogram Dendrogram, dist [][]float64, ignore int) (optimized Dendrogram)

Tree

Tree takes the dendrogram produced by hclust.Cluster or hclust.Optimize and a list of names for the leaves and generates a newick tree. It also returns the names vector sorted based on the clustering order. The input names vector should be in the same order as the rows/columns of the starting matrix/table used for generating the distance matrix. For example, if you are generating a distance matrix between rows of a matrix and those rows are ordered as:

column1 column2
rowA 10 3.14
rowB 5.4 8
rowC 7 2.1

then the names vector should be []string{"rowA", "rowB", "rowC"}.

The dendrogram from hclust.Cluster or hclust.Optimize should be input as produced by those methods without any modifications.

type TreeLayout struct {
	Newick     string
	Order      []string
}

hclust.Tree(dendrogram Dendrogram, names []string) (tree TreeLayout, err error)
Sort

The hclust.Sort method can be used to sort the original data matrix that was input to hclust.Distance based on the clustering order. The method requires a vector containing the names of the rows/columns in their original order and a vector with the sorted order. The sorted order can be obtained from the hclust.Tree method. The dim argument must be one of "column" or "row". To sort a matrix by both column and row, simply call this method twice (once for columns and once for rows).

hclust.Sort(matrix [][]float64, names, sortOrder []string, dim string) (sorted [][]float64, err error)

Benchmarks

Benchmarking tests were performed using a single core on a 3.7 GHz Quad-Core Intel Xeon E5 processor with 32 GB RAM.

Distance matrix benchmarks were measured using an input table with 4157 rows and 199 columns, with the distances calculated between rows.

Distance metric Execution time
Canberra 5.1s
Euclidean 1.6s
maximum 2.8s

Clustering benchmarks were measured using a symmetric distance matrix with dimensions 4157×4157.

Linkage method Execution time
complete 1.7s
median 1.82s
single 1.78s

The Leaf optimization benchmark was measured using a dendrogram with 4157 leafs (4156 internal nodes).

Distance metric Linkage method Execution time
maximum single 13.7s
Euclidean complete 1m29.4s

Tests

go test

© 2018 James Knight.

Documentation

Overview

Package hclust contains methods for performing agglomerative hierarchical clustering.

Index

Constants

This section is empty.

Variables

View Source
var Cluster = cluster.Cluster

Cluster references the main cluster method in the cluster subpackage.

View Source
var Distance = distance.Distance

Distance references the main distance method in the distance subpackage.

View Source
var GetNodeHeight = dendrogram.GetNodeHeight

GetNodeHeights gets the height for each dendrogram node by summing child branch lengths.

View Source
var Optimize = optimize.Optimize

Optimize references the main leaf optimization method in the optimize subpackage.

View Source
var Sort = sort.Sort

Sort references the main sort method in the sort subpackage

View Source
var Tree = tree.Create

Tree references the main method for generating the newick tree in the tree subpackage.

Functions

This section is empty.

Types

type Dendrogram added in v1.0.2

type Dendrogram []SubCluster

Dendrogram is an array of SubClusters.

type SubCluster

type SubCluster = typedef.SubCluster

SubCluster stores the node, distance and names of leafs for a subcluster.

type TreeLayout added in v1.0.1

type TreeLayout = tree.Tree

TreeLayout contains a tree in newick format and the leaf order.

Directories

Path Synopsis
Package cluster contains methods for clustering.
Package cluster contains methods for clustering.
Package dendrogram has helpers for analyzing dendrograms.
Package dendrogram has helpers for analyzing dendrograms.
Package distance contains methods to generate a distance matrix.
Package distance contains methods to generate a distance matrix.
Package matrixop contains methods for operating on matrices.
Package matrixop contains methods for operating on matrices.
Package optimize optimizes the clustering order.
Package optimize optimizes the clustering order.
Package sort will sort a matrix by row or column.
Package sort will sort a matrix by row or column.
Package typedef has type definitions used throughout the hclust package.
Package typedef has type definitions used throughout the hclust package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL