dmmclust

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 12, 2017 License: MIT Imports: 4 Imported by: 2

README

DMMClust GoDoc Build Status Coverage Status Go Report Card

package dmmclust is a package that provides functions for clustering small texts as described by Yin and Wang (2014) in A Dirichlet Multinomial Mixture Model based Approach for Short Text Clustering.

The clustering algorithm is remarkably elegant and simple, leading to a very minimal implementation. This package also exposes some types to allow for extensibility.

Installing

go get -u github.com/go-nlp/dmmclust.

This package also provides a Gopkg.toml file for dep users.

This package uses SemVer 2.0 for versioning, and all releases are tagged.

How To Use

func main(){
	docs := getDocs()
	corp := getCorpus(docs)
	conf := dmmclust.Config{
		K:          10,                   // maximum 10 clusters expected
		Vocabulary: len(corp),            // simple example: the vocab is the same as the corpus size
		Iter:       100,                  // iterate 100 times
		Alpha:      0.0001,               // smaller probability of joining an empty group
		Beta:       0.1,                  // higher probability of joining groups like me
		Score:      dmmclust.Algorithm3,  // use Algorithm3 to score
		Sample:     dmmclust.Gibbs, // use Gibbs to sample
	}

	var clustered []dmmclust.Cluster // len(clustered) == len(docs)
	var err error
	if clustered, err = dmmclust.FindClusters(docs, conf); err != nil {
		log.Fatal(err)
	}
	fmt.Println("Clusters:")
	for i, clust := range clustered {
		fmt.Printf("\t%d: %q\n", clust.ID(), data[i])
	}
}

Hyperparameters

  • K represents the maximum number of clusters expected. The final number of clusters can never exceed K.
  • Alpha represents the probability of joining an empty group. If Alpha is 0.0 then once a group is empty, it'll stay empty for the rest of the
  • Beta represents the probability of joining groups that are similar. If Beta is 0.0, then a document will never join a group if there are no common words between the groups and the documents. In some cases this is preferable (highly preprocessed inputs for example).

Playing Well With Other Packages

This package was originally built to play well with lingo. It's why it works on slices of integers. That's the only preprocessing necessary - converting a sentence into a slice of ints.

The Document interface is defined as:

type Document interface {
	TokenSet() TokenSet
	Len() int
}

TokenSet is simply a []int, where each ith element represents the word ID of a corpus. The order is not important in the provided algorithms (Algorithm3 and Algorithm4), but may be important in some other scoring function.

Extensibility

This package defines a Scoring Function as type ScoringFn func(doc Document, docs []Document, clusters []Cluster, conf Config) []float64. This allows for custom scoring functions to be used.

There are two scoring algorithms provided: Algorithm3 and Algorithm4. I've been successful at using other scoring algorithms as well.

The sampling function is also customizable. The default is to use Gibbs. I've not had much success at other sampling algorithms.

Contributing

To contribute to this package, simply file an issue, discuss and then send a pull request. Please ensure that tests are provided in any changes.

Licence

This package is MIT licenced.

Documentation

Overview

Example
package main

import (
	"fmt"
	"math/rand"
	"strings"

	. "github.com/go-nlp/dmmclust"
	"github.com/xtgo/set"
)

// data is some sample tweets from @chewxy.
// They have been "preprocessed" to aid the simple tokenizer.
// one hashtag has been changed from #gopherdata to #gopher
var data = []string{
	// coffee related tweet
	"A Java prefix that I don't hate .",
	"Colleagues must have thought I was crazy on Friday, but @MeccaCoffee ' s latest Xade Burqa is nothing short of orgasm inducing .",

	// JavaScript hate
	"Let me take this time while I wait for your JavaScript to download to tell you to stop using so much JavaScript on your web page .",

	// On Error Resume Next
	"all future programming languages I implement will have On Error Resume Next . Even if it's a functional , expressions-only language . Because I can. ",
	"On Error Resume Next",
	"When I was younger , I used VB. My crutch was On Error Resume Next . I find it weird reimplementing it for a probabilistic parser .",

	// Gophers/Golang
	"Questions for #gopher and #golang people out there : how do you debug a slow compile ? ",
	"In case you missed it , 10000 words on generics in #golang :",
	"Data Science in Go https://speakerdeck.com/chewxy/data-science-in-go … Slides by @chewxy #gopher #golang",
	"Big heap , many pointers . GC killing me . Help ? Tips? #golang . Most pointers unavoidable .",
}

func makeCorpus(a []string) map[string]int {
	retVal := make(map[string]int)
	var id int
	for _, s := range a {
		for _, f := range strings.Fields(s) {
			if _, ok := retVal[f]; !ok {
				retVal[f] = id
				id++
			}
		}
	}
	return retVal
}

func makeDocuments(a []string, c map[string]int, allowRepeat bool) []Document {
	retVal := make([]Document, 0, len(a))
	for _, s := range a {
		var ts []int
		for _, f := range strings.Fields(s) {
			id := c[f]
			ts = append(ts, id)
		}
		if !allowRepeat {
			ts = set.Ints(ts) // this uniquifies the sentence
		}
		retVal = append(retVal, TokenSet(ts))
	}
	return retVal
}

func main() {
	corp := makeCorpus(data)
	docs := makeDocuments(data, corp, false)
	r := rand.New(rand.NewSource(1337))
	conf := Config{
		K:          10,          // maximum 10 clusters expected
		Vocabulary: len(corp),   // simple example: the vocab is the same as the corpus size
		Iter:       1000,        // iterate 100 times
		Alpha:      0.0001,      // smaller probability of joining an empty group
		Beta:       0.1,         // higher probability of joining groups like me
		Score:      Algorithm3,  // use Algorithm3 to score
		Sampler:    NewGibbs(r), // use Gibbs to sample
	}
	var clustered []Cluster
	var err error
	if clustered, err = FindClusters(docs, conf); err != nil {
		fmt.Println(err)
	}
	fmt.Println("Clusters (Algorithm3):")
	for i, clust := range clustered {
		fmt.Printf("\t%d: %q\n", clust.ID(), data[i])
	}

	// Using Algorithm4, where repeat words are allowed
	docs = makeDocuments(data, corp, true)
	conf.Score = Algorithm4
	if clustered, err = FindClusters(docs, conf); err != nil {
		fmt.Println(err)
	}

	fmt.Println("\nClusters (Algorithm4):")
	for i, clust := range clustered {
		fmt.Printf("\t%d: %q\n", clust.ID(), data[i])
	}

}
Output:

Clusters (Algorithm3):
	0: "A Java prefix that I don't hate ."
	0: "Colleagues must have thought I was crazy on Friday, but @MeccaCoffee ' s latest Xade Burqa is nothing short of orgasm inducing ."
	1: "Let me take this time while I wait for your JavaScript to download to tell you to stop using so much JavaScript on your web page ."
	2: "all future programming languages I implement will have On Error Resume Next . Even if it's a functional , expressions-only language . Because I can. "
	2: "On Error Resume Next"
	2: "When I was younger , I used VB. My crutch was On Error Resume Next . I find it weird reimplementing it for a probabilistic parser ."
	3: "Questions for #gopher and #golang people out there : how do you debug a slow compile ? "
	3: "In case you missed it , 10000 words on generics in #golang :"
	3: "Data Science in Go https://speakerdeck.com/chewxy/data-science-in-go … Slides by @chewxy #gopher #golang"
	1: "Big heap , many pointers . GC killing me . Help ? Tips? #golang . Most pointers unavoidable ."

Clusters (Algorithm4):
	0: "A Java prefix that I don't hate ."
	0: "Colleagues must have thought I was crazy on Friday, but @MeccaCoffee ' s latest Xade Burqa is nothing short of orgasm inducing ."
	1: "Let me take this time while I wait for your JavaScript to download to tell you to stop using so much JavaScript on your web page ."
	2: "all future programming languages I implement will have On Error Resume Next . Even if it's a functional , expressions-only language . Because I can. "
	2: "On Error Resume Next"
	2: "When I was younger , I used VB. My crutch was On Error Resume Next . I find it weird reimplementing it for a probabilistic parser ."
	3: "Questions for #gopher and #golang people out there : how do you debug a slow compile ? "
	3: "In case you missed it , 10000 words on generics in #golang :"
	3: "Data Science in Go https://speakerdeck.com/chewxy/data-science-in-go … Slides by @chewxy #gopher #golang"
	2: "Big heap , many pointers . GC killing me . Help ? Tips? #golang . Most pointers unavoidable ."

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Algorithm3

func Algorithm3(doc Document, docs []Document, clusters []Cluster, conf Config) []float64

Algorithm3 is the implementation of Equation 3 in the original paper. This assumes that a word can only occur once in a string. If the requirement is that words can appear multiple times, use Algorithm4.

func Algorithm4

func Algorithm4(doc Document, docs []Document, clusters []Cluster, conf Config) []float64

Algorithm4 is the implementation of Equation 4 in the original paper. It allows for multiple words to be used in a document.

Types

type Cluster

type Cluster struct {
	// contains filtered or unexported fields
}

Cluster is a representation of a cluster. It doesn't actually store any of the data, only the metadata of the cluster

func FindClusters

func FindClusters(docs []Document, conf Config) ([]Cluster, error)

FindClusters is the main function to find clusters.

func (*Cluster) Docs

func (c *Cluster) Docs() int

Docs returns the number of documents in the cluster

func (*Cluster) Freq

func (c *Cluster) Freq(i int) float64

Freq returns the frequency of the ith word

func (*Cluster) ID

func (c *Cluster) ID() int

ID returns the ID of the cluster. This is not set or used until the results are returned

func (*Cluster) Wordcount

func (c *Cluster) Wordcount() int

Wordcount returns the number of words in the cluster

func (*Cluster) Words

func (c *Cluster) Words() []int

Words returns the word IDs in the cluster

type Config

type Config struct {
	// Maximum number of clusters expected
	K int

	// Vocabulary is the size of the vocabulary
	Vocabulary int

	// Number of iterations
	Iter int

	// Probability that controls whether a student will join an empty table
	Alpha float64

	// Betweenness affinity
	Beta float64

	// Score is a scoring function that will be used
	Score ScoringFn

	// Sampler is the sampler function
	Sampler Sampler
}

Config is a struct that configures the running of the algorithm.

type Document

type Document interface {
	// IDs returns a list of word IDs. It's preferable to return an ordered list, rather than a uniquified set.
	IDs() []int

	// Len returns the number of words in the document
	Len() int
}

Document is anything that can return a TokenSet

type Gibbs

type Gibbs struct {
	randomkit.BinomialGenerator
}

Gibbs is the standard sampling function, as per the paper.

func NewGibbs added in v1.1.0

func NewGibbs(rand *rand.Rand) *Gibbs

func (*Gibbs) Sample added in v1.1.0

func (s *Gibbs) Sample(p []float64) int

Gibbs returns the index sampled

type Sampler added in v1.1.0

type Sampler interface {
	Sample([]float64) int
}

Sampler is anything that can generate a index based on the given probability

type ScoringFn

type ScoringFn func(doc Document, docs []Document, clusters []Cluster, conf Config) []float64

ScoringFn is any function that can take a document and return the probabilities of it existing in those clusters

type TokenSet

type TokenSet []int

TokenSet is a vector of word IDs for a document. Depending on algorithm, it may be uniquified using a set function that does not preserve order

func (TokenSet) IDs added in v1.1.0

func (ts TokenSet) IDs() []int

func (TokenSet) Len

func (ts TokenSet) Len() int

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL