cgmlst-clustering

command module

v0.0.0-...-a195811 Latest Latest Go to latest Published: Apr 4, 2024 License: MIT Imports: 18 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/pathogenwatch-oss/cgmlst-clustering

Links

Open Source Insights

README ¶

Clustering

This module measures the distance between cgMLST profiles and performs Single Linkage Clustering using the SLINK algorithm. For N documents, SLINK calculates two arrays of length N called pi and lambda. The documents can be clustered quickly from these two arrays at arbitrary thresholds.

Inputs

This takes documents of three types

Request - always 1
Cache - 1 or none
Profiles - more than one and depends on the request

All inputs are encoded in BSON

The request lists the STs which we would like to cluster together. It also specifies the threshold below which we should record a cache of distances between STs (i.e. if the distance is greater than this value, the value is not recorded).

The cache is optional. It includes the known scores between a set of documents, a list of STs which those distances refer to, and the SLINK parameters (pi & lambda). Note that the order of the STs in the cache matter. We can reuse the parameters lambda and pi if all of the cache STs are in the list of request STs. This requires us to create a new ordering of STs in the output which starts with the cache STs and does not have any duplicates. We can call these the requestedSTs, cachedSTs and outputSTs respectivly.

Profiles are just the analysis documents from the cgMLST tasks. They may be supplied in any order.

Outputs

It outputs

Progress events - normally more than one
Score documents - T + 2 where T is the "threshold"

All outputs are encoded in JSON

Progress events are periodically sent and give an estimate of the progress as a percentage.

A score document is returned for each distance between 0 and T listing pairs of STs which are that distance from one another. The pairs are encoded as the index into the array of outputSTs. An additonal document is also sent which includes the SLINK parameters pi and lambda.

Internals

The cache includes the SLINK parameters pi and lambda as well as the order of the STs which they correspond to. They also include the distances between STs below a specified threshold. These are stored as a map of the distance and the pair of the edges which are that distance from one another. These are encoded as the index of the ST into the STs array. Other data structures might be simpler but this is probably one of the more scalable as the datasets get really big (i.e. 100k+ genomes).

The index is an efficient storage structure which holds the cgMLST records as bitarrays. These can then be compared really quickly.

The scores code holds an array of all the STs we'd like to compare in the order in which they will need to be given to the clustering code. This datastructure records whether the STs have already been scored (or taken from the cache) so that jobs can be distributed across the workers.

Testdata

Rather than storing hundreds of MB of testdata in binary JSON, I've included a JS script which you can use to deterministically generate testdata (somewhat randomly).

cd testdata
yarn
node createTestData.js
cd ..
go test

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL