dbscan

package module
v0.0.0-...-f30c2f0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 21, 2015 License: BSD-3-Clause Imports: 6 Imported by: 1

README

DBSCAN (go package) GoDoc

DBSCAN (Density-based spatial clustering) clustering optimized for multicore processing.

Idea

If the distance of two points in any dimension is more than eps, than the total distance is more than eps

  1. Calculate variance for each dimension (in parallel), find & store dimension with max variance

  2. Sort by the dimension with max variance

  3. Build a neighborhood map in parallel (Euclidean distance)

  • Slide through sorted data, from lowest to highest. Sliding performed until neighbor_value <= curr_value + eps

  • Use array / indices to store neighbors lists; ConcurrentLinkedQueue holds density reachable points.

  1. Use DFS (Depth First Search) to find clusters
Note: Unless go1.5 (or newer), need to set GOMAXPROCS=XX to utilize all cores
Example explaining how it works

Consider the following points:

Name => { X, Y }
"0" => { 2, 4 }
"1" => { 7, 3 }
"2" => { 3, 5 }
"3" => { 5, 3 }
"4" => { 7, 4 }
"5" => { 6, 8 }
"6" => { 6, 5 }
"7" => { 8, 4 }
"8" => { 2, 5 }
"9" => { 3, 7 }

Let eps = 2.0, minPts = 2:

  Y|
   |
  8|                       x <- Noise
   |
  7|           x
   |           |
  6|           | <- Points are density connected (from [3,5])
   |           |
  5|       x---x           x
   |         /               \
  4|       x                   x---x
   |                           |
  3|                   x-------x
   |                        /\
  2|                   Points are density reachable
   |
  1|
___|___________________________________
  0|   1   2   3   4   5   6   7   8  X

If we'll consider only X dimension, points sorted by this dimension will look like:

X  0  1  2  3  4  5  6  7  8  9
P        o  o     o  o  o  o

"0": 2.0, "8": 2.0, "2": 3.0, "9": 3.0, "3": 5.0, "5": 6.0, "6": 6.0, "1": 7.0, "4": 7.0, "7": 8.0
or
"0": [2.0, 4.0], "8": [2.0, 5.0], "2": [3.0, 5.0], "9": [3.0, 7.0], "3": [5.0, 3.0], "5": [6.0, 8.0], "6": [6.0, 5.0], "1": [7.0, 3.0], "4": [7.0, 4.0], "7": [8.0, 4.0]

We build neighborhood map (array containing concurrent non-blocking queues of density connected points' indices), by considering forward points, which are reachable by eps:

                       0    1    2    3    4    5    6    7    8    9
Sorted data points: [ "0", "8", "2", "9", "3", "5", "6", "1", "4", "7" ]

Point "0": 2.0, sees 3 points in range dimensionValue + eps - "8": 2.0, "2": 3.0, "9": 3.0
(e.g. 2.0 + 2.0 =4 - every point with X dimension value above 4 is definitely not reachable)
Then we check if they are reachable with Euclidean distance and add to neighbors.
Point "9": [3.0, 7.0] isn't density reachable to point "0": [2.0, 4.0]

Neighborhood map:
0: [ 1, 2 ]
1: [ 0 ]
2: [ 0 ]

Point "8": 2.0, sees "2": 3.0, "9": 3.0; but point "9": [3.0, 7.0] isn't density reachable
Neighborhood map:
0: [ 1, 2 ]
1: [ 0, 2 ]
2: [ 0, 1 ]

Point "2": 3.0, sees "9": 3.0, "3": 5.0; but point "3": [5.0, 3.0] isn't density reachable
Neighborhood map:
0: [ 1, 2 ]
1: [ 0, 2 ]
2: [ 0, 1, 3 ]
3: [ 2 ]

Note that we slide only forward, thus also add back-references to self index. Concurrent non-blocking queue allows to add in parallel (order doesn't matter).

Finally neighborhood map:

0: [0, 1, 2]
1: [0, 1, 2]
2: [0, 2, 3, 1]
3: [3, 2]
4: [4, 7]
5: [5]
6: [6, 8]
7: [7, 8, 9, 4]
8: [6, 8, 9, 7]
9: [8, 9, 7]

Now its easy to traverse and find clusters with Depth First Search (or BFS)

Cluster size: 4
 Point: 0, [2.0, 4.0]
 Point: 2, [3.0, 5.0]
 Point: 8, [2.0, 5.0]
 Point: 9, [3.0, 7.0]
Cluster size: 5
 Point: 3, [5.0, 3.0]
 Point: 1, [7.0, 3.0]
 Point: 7, [8.0, 4.0]
 Point: 4, [7.0, 4.0]
 Point: 6, [6.0, 5.0]

Documentation

Overview

DBSCAN (Density-based spatial clustering) clustering optimized for multicore processing.

Usage example:

var clusterer = NewDBSCANClusterer( 2.0, 2 )

var data = []ClusterablePoint{
		&NamedPoint{"0", []float64{2, 4}},
		&NamedPoint{"1", []float64{7, 3}},
		&NamedPoint{"2", []float64{3, 5}},
		&NamedPoint{"3", []float64{5, 3}},
		&NamedPoint{"4", []float64{7, 4}},
	}

clusterer.MinPts = 2
clusterer.SetEps( 2.0 )

// Automatic discovery of dimension with max variance
clusterer.AutoSelectDimension = false
// Set dimension manually
clusterer.SortDimensionIndex = 1

var result  [][]ClusterablePoint = clusterer.Cluster(data)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Variance

func Variance(data []ClusterablePoint, dimension int) float64

Types

type ClusterablePoint

type ClusterablePoint interface {
	GetPoint() []float64
	String() string
}

func NamedPointToClusterablePoint

func NamedPointToClusterablePoint(in []*NamedPoint) (out []ClusterablePoint)

type ClusterablePointSlice

type ClusterablePointSlice struct {
	Data          []ClusterablePoint
	SortDimension int
}

Slice attaches the methods of Interface to []float64, sorting in increasing order.

func (ClusterablePointSlice) Len

func (self ClusterablePointSlice) Len() int

func (ClusterablePointSlice) Less

func (self ClusterablePointSlice) Less(i, j int) bool

func (ClusterablePointSlice) Sort

func (self ClusterablePointSlice) Sort()

Sort is a convenience method.

func (ClusterablePointSlice) Swap

func (self ClusterablePointSlice) Swap(i, j int)

type Clusterer

type Clusterer interface {
	Cluster([]ClusterablePoint) [][]ClusterablePoint
}

type ConcurrentQueue_InsertOnly

type ConcurrentQueue_InsertOnly struct {
	Head unsafe.Pointer
	Size uint64
}

func NewConcurrentQueue_InsertOnly

func NewConcurrentQueue_InsertOnly() *ConcurrentQueue_InsertOnly

func (*ConcurrentQueue_InsertOnly) Add

func (self *ConcurrentQueue_InsertOnly) Add(value uint)

func (*ConcurrentQueue_InsertOnly) Slice

func (self *ConcurrentQueue_InsertOnly) Slice() []uint

type DBSCANClusterer

type DBSCANClusterer struct {
	MinPts, SortDimensionIndex int
	AutoSelectDimension        bool
	// contains filtered or unexported fields
}

func NewDBSCANClusterer

func NewDBSCANClusterer(eps float64, minPts int) *DBSCANClusterer

func (*DBSCANClusterer) BuildNeighborhoodMap

func (this *DBSCANClusterer) BuildNeighborhoodMap(data []ClusterablePoint) []*ConcurrentQueue_InsertOnly

func (*DBSCANClusterer) CalcDistance

func (this *DBSCANClusterer) CalcDistance(aPoint, bPoint []float64) float64

func (*DBSCANClusterer) Cluster

func (this *DBSCANClusterer) Cluster(data []ClusterablePoint) [][]ClusterablePoint

* step 1: sort data by a dimension step 2: slide through sorted data (in parallel), and compute all points in range of eps (everything above eps is definitely isn't directly reachable) step 3: build neighborhood map & proceed DFS *

func (*DBSCANClusterer) GetEps

func (this *DBSCANClusterer) GetEps() float64

func (*DBSCANClusterer) PredictDimensionByMaxVariance

func (this *DBSCANClusterer) PredictDimensionByMaxVariance(data []ClusterablePoint) int

*

  • Calculate variance for each dimension (in parallel), returns dimension index with max variance

func (*DBSCANClusterer) SetEps

func (this *DBSCANClusterer) SetEps(eps float64)

type NamedPoint

type NamedPoint struct {
	Name  string
	Point []float64
}

func NewNamedPoint

func NewNamedPoint(name string, point []float64) *NamedPoint

func (*NamedPoint) Copy

func (self *NamedPoint) Copy() *NamedPoint

func (*NamedPoint) GetPoint

func (self *NamedPoint) GetPoint() []float64

func (*NamedPoint) String

func (self *NamedPoint) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL