CloudForest

package module
v0.0.0-...-805a285 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 23, 2013 License: BSD-3-Clause Imports: 13 Imported by: 0

README

This will eventually be the home of a stable veriosn of CloudForest for now see http://github.com/ryanbressler/CloudForest
for the dev version. 

Documentation

Overview

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang). It includes implementations of Breiman and Cutler's Random Forest for classification and regression on heterogeneous numerical/categorical data with missing values and several related algorithms including entropy and cost driven classification, L1 regression and feature selection with artificial contrasts and hooks for modifying the algorithms for your needs.

Command line utilities to grow, apply and analyze forests are provided in sub directories.

CloudForest is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute.

Documentation has been generated with godoc and can be viewed live at: http://godoc.org/github.com/ryanbressler/CloudForest

Pull requests and bug reports are welcome; Code Repo and Issue tracker can be found at: https://github.com/ryanbressler/CloudForest

Goals

CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code.

Data structures and file formats are chosen with use in multi threaded and cluster environments in mind.

Working with Trees

Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method:

t.Root.Recurse(func(n *Node, innercases []int) {

	if (2 * leafSize) <= len(innercases) {
		SampleFirstN(&candidates, mTry)
		best, impDec := fm.BestSplitter(target, innercases, candidates[:mTry], allocs)
		if best != nil && impDec > minImp {
			//not a leaf node so define the splitter and left and right nodes
			//so recursion will continue
			n.Splitter = best
			n.Pred = ""
			n.Left = new(Node)
			n.Right = new(Node)

			return
		}
	}

This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects.

Alternative Impurities

Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets:

EntropyTarget : For use in entropy minimizing classification
RegretTarget  : For use in classification driven by differing costs in miscategorization.
L1Target      : For use in L1 norm error regression (which may be less sensitive to outliers).

Efficient Splitting

Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks.

Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like:

func(s *Splitter) SplitInPlace(fm *FeatureMatrix, cases []int) (l []int, r []int)

can return left and right slices that point to the same underlying array as the origional slice of cases but these slices should not have their values changed.

Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like:

func (fm *FeatureMatrix) BestSplitter(target Target,
	cases []int,
	candidates []int,
	allocs *BestSplitAllocs) (s *Splitter, impurityDecrease float64)

func (f *Feature) BestSplit(target Target,
	cases *[]int,
	parentImp float64,
	allocs *BestSplitAllocs) (bestNum float64, bestCat int, bestBigCat *big.Int, impurityDecrease float64)

For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig.

All numerical predictors are handled by BestNumSplit which relies on go's sorting package.

Parallelism and Scaling

Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk.

Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean.

Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk.

For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc.

Missing Values

By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data:

I(split) = p(l)I(l)+p(r)I(r)+p(m)I(m)

Missing values in the target variable are left out of impurity calculations.

This provided generally good results at a fraction of the computational costs of imputing data.

Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values.

This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes.

Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing.

At some point in the future support may be added for local imputing of missing values during tree growth as described in [3]

[1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

[2] https://code.google.com/p/rf-ace/

[3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record

Importance and Contrasts

Variable Importance in CloudForest is calculated as the mean decrease in impurity over all of the splits made using a feature.

To provide a baseline for evaluating importance, artificial contrast features can be used by including shuffled copies of existing features.

Main Structures

In CloudForest data is stored using the FeatureMatrix struct which contains Features.

The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split.

The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification.

Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility.

Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.

Compiling for Speed

When compiled with go1.1 CloudForest achieves running times similar to implementations in other languages. Using gccgo (4.8.0 at least) results in longer running times and is not recommended until full go1.1 support is implemented in gcc 4.8.1.

Roadmap

Development of CloudForest is being driven by our needs as we analyze large biomedical data sets. As such new and modified analysis will be added as needed. The basic functionality has stabilized but we have discussed several possible changes that may require additional abstraction and/or changes in the api. These include:

Allow additional types of candidate features. Some multidimensional data types may not be best served by decomposition into categorical and numerical features. It would be possible to allow arbitrary feature types by adding CanidateFeature (which should expose BestSplit), CodedSplitter and Splitter abstraction.

Allowing data to reside anywhere. This would involve abstracting FeatureMatrix to allow database etc driven implementations.

Growforest Utility

"growforest" trains a forest using the following parameters which can be listed with -h

Usage of growforest:
  -contrastall=false: Include a shuffled artificial contrast copy of every feature.
  -cost="": For categorical targets, a json string to float map of the cost of falsely identifying each category.
  -cpuprofile="": write cpu profile to file
  -entropy=false: Use entropy minimizing classification (target must be categorical).
  -importance="": File name to output importance.
  -impute=false: Impute missing values to feature mean/mode instead of filtering them out when splitting.
  -l1=false: Use l1 norm regression (target must be numeric).
  -leafSize=0: The minimum number of cases on a leaf node. If <=0 will be inferred to 1 for classification 4 for regression.
  -mTry=0: Number of candidate features for each split. Inferred to ceil(swrt(nFeatures)) if <=0.
  -nContrasts=0: The number of randomized artificial contrast features to include in the feature matrix.
  -nCores=1: The number of cores to use.
  -nSamples=0: The number of cases to sample (with replacement) for each tree grow. If <=0 set to total number of cases
  -nTrees=100: Number of trees to grow in the predictor.
  -rfpred="rface.sf": File name to output predictor forest in sf format.
  -splitmissing=false: Split missing values onto a third branch at each node (experimental).
  -target="": The row header of the target in the feature matrix.
  -train="featurematrix.afm": AFM formated feature matrix containing training data.

Applyforrest Utility

"applyforest" applies a forest to the specified feature matrix and outputs predictions as a two column (caselabel predictedvalue) tsv.

Usage of applyforest:
  -fm="featurematrix.afm": AFM formated feature matrix containing test data.
  -preds="predictions.tsv": The name of a file to write the predictions into.
  -rfpred="rface.sf": A predictor forest.

Errorrate Utility

errorrate calculates the error of a forest vs a testing data set and reports it to standard out

Usage of errorrate:
  -fm="featurematrix.afm": AFM formated feature matrix containing test data.
  -rfpred="rface.sf": A predictor forest.

Leafcount Utility

leafcount outputs counts of case case co-occurrence on leaf nodes (Brieman's proximity) and counts of the number of times a feature is used to split a node containing each case (a measure of relative/local importance).

Usage of leafcount:
  -branches="branches.tsv": a case by feature sparse matrix of leaf co-occurance in tsv format
  -fm="featurematrix.afm": AFM formated feature matrix to use.
  -leaves="leaves.tsv": a case by case sparse matrix of leaf co-occurance in tsv format
  -rfpred="rface.sf": A predictor forest.

Feature Matrix Files

CloudForest borrows the annotated feature matrix (.afm) and stoicastic forest (.sf) file formats from Timo Erkkila's rf-ace which can be found at https://code.google.com/p/rf-ace/

An annotated feature matrix (.afm) file is a tab delineated file with column and row headers. Columns represent cases and rows represent features. A row header/feature id includes a prefix to specify the feature type

"N:" Prefix for numerical feature id.
"C:" Prefix for categorical feature id.
"B:" Prefix for boolean feature id.

Categorical and boolean features use strings for their category labels. Missing values are represented by "?","nan","na", or "null" (case insensitive). A short example:

featureid	case1	case2	case3
N:NumF1	0.0	.1	na
C:CatF2 red	red	green

Stochastic Forest Files

A stochastic forest (.sf) file contains a forest of decision trees. The main advantage of this format as opposed to an established format like json is that an sf file can be written iteratively tree by tree and multiple .sf files can be combined with minimal logic required allowing for massively parallel growth of forests with low memory use.

An .sf file consists of lines each of which is a comma separated list of key value pairs. Lines can designate either a FOREST, TREE, or NODE. Each tree belongs to the preceding forest and each node to the preceding tree. Nodes must be written in order of increasing depth.

CloudForest generates fewer fields then rf-ace but requires the following. Other fields will be ignored

Forest requires forest type (only RF currently), target and ntrees:

FOREST=RF|GBT|..,TARGET="$feature_id",NTREES=int

Tree requires only an int and the value is ignored though the line is needed to designate a new tree:

TREE=int

Node requires a path encoded so that the root node is specified by "*" and each split left or right as "L" or "R". Leaf nodes should also define PRED such as "PRED=1.5" or "PRED=red". Splitter nodes should define SPLITTER with a feature id inside of double quotes, SPLITTERTYPE=[CATEGORICAL|NUMERICAL] and a LVALUE term which can be either a float inside of double quotes representing the highest value sent left or a ":" separated list of categorical values sent left.

NODE=$path,PRED=[float|string],SPLITTER="$feature_id",SPLITTERTYPE=[CATEGORICAL|NUMERICAL] LVALUES="[float|: separated list"

An example .sf file:

FOREST=RF,TARGET="N:CLIN:TermCategory:NB::::",NTREES=12800
TREE=0
NODE=*,PRED=3.48283,SPLITTER="B:SURV:Family_Thyroid:F::::maternal",SPLITTERTYPE=CATEGORICAL,LVALUES="false"
NODE=*L,PRED=3.75
NODE=*R,PRED=1

Cloud forest can parse and apply .sf files generated by at least some versions of rf-ace.

References

The idea for (and trademark of the term) Random Forests originated with Leo Brieman and Adele Cuttler. Their code and paper's can be found at:

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

All code in CloudForest is original but some ideas for methods and optimizations were inspired by Timo Erkilla's rf-ace and Andy Liaw and Matthew Wiener randomForest R package based on Brieman and Cuttler's code:

https://code.google.com/p/rf-ace/ http://cran.r-project.org/web/packages/randomForest/index.html

The idea for Artificial Contrasts was found in: Eugene Tuv, Alexander Borisov, George Runger and Kari Torkkola's paper "Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination" http://www.researchgate.net/publication/220320233_Feature_Selection_with_Ensembles_Artificial_Variables_and_Redundancy_Elimination/file/d912f5058a153a8b35.pdf

The idea for growing trees to minimize categorical entropy comes from Ross Quinlan's ID3: http://en.wikipedia.org/wiki/ID3_algorithm

"The Elements of Statistical Learning" 2nd edition by Trevor Hastie, Robert Tibshirani and Jerome Friedman was also consulted during development.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewRunningMeans

func NewRunningMeans(size int) *[]*RunningMean

func SampleFirstN

func SampleFirstN(deck *[]int, n int)

SampleFirstN ensures that the first n entries in the supplied deck are randomly drawn from all entries without replacement for use in selecting candidate features to split on. It accepts a pointer to the deck so that it can be used repeatedly on the same deck avoiding reallocations.

func SampleWithReplacment

func SampleWithReplacment(nSamples int, totalCases int) (cases []int)

SampleWithReplacment samples nSamples random draws from [0,totalCases) with replacement for use in selecting cases to grow a tree from.

Types

type AdaBoostTarget

type AdaBoostTarget struct {
	*Feature
	Weights []float64
}

AdaBoostTarget wraps a numerical feature as a target for us in Adaptive Boosting (AdaBoost)

func NewAdaBoostTarget

func NewAdaBoostTarget(f *Feature) (abt *AdaBoostTarget)

func (*AdaBoostTarget) Boost

func (t *AdaBoostTarget) Boost(leaves *[][]int) (weight float64)

func (*AdaBoostTarget) Impurity

func (target *AdaBoostTarget) Impurity(cases *[]int, counter *[]int) (e float64)

AdaBoostTarget.Impurity is an AdaBoosting that uses the weights specified in AdaBoostTarget.weights.

func (*AdaBoostTarget) SplitImpurity

func (target *AdaBoostTarget) SplitImpurity(l []int, r []int, counter *[]int) (impurityDecrease float64)

AdaBoostTarget.SplitImpurity is an AdaBoosting version of SplitImpurity.

type BestSplitAllocs

type BestSplitAllocs struct {
	Left       *[]int
	Right      *[]int
	NonMissing *[]int
	Counter    *[]int
	Sorter     *SortableFeature
}

BestSplitAllocs contains reusable allocations for split searching.

func NewBestSplitAllocs

func NewBestSplitAllocs(nTotalCases int, target Target) (bsa *BestSplitAllocs)

NewBestSplitAllocs initializes all of the reusable allocations for split searching to the appropriate size. nTotalCases should be number of total cases in the feature matrix being analyzed.

type BoostingTarget

type BoostingTarget interface {
	NCats() (n int)
	SplitImpurity(l []int, r []int, counter *[]int) (impurityDecrease float64)
	Impurity(cases *[]int, counter *[]int) (impurity float64)
	Boost(partition *[][]int) (weight float64)
	FindPredicted(cases []int) (pred string)
}

BoostingTarget

type CatBallot

type CatBallot struct {
	Mutex sync.Mutex
	Map   map[int]float64
}

Cat ballot is used insideof CatBallotBox to record catagorical votes in a thread safe manner.

func NewCatBallot

func NewCatBallot() (cb *CatBallot)

NewCatBallot returns a pointer to an initalized CatBallot with a 0 size Map.

type CatBallotBox

type CatBallotBox struct {
	*CatMap
	// contains filtered or unexported fields
}

Keeps track of votes by trees in a thread safe manner.

func NewCatBallotBox

func NewCatBallotBox(size int) *CatBallotBox

Build a new ballot box for the number of cases specified by "size".

func (*CatBallotBox) Tally

func (bb *CatBallotBox) Tally(i int) (predicted string)

TallyCatagorical tallies the votes for the case specified by i as if it is a Categorical or boolean feature. Ie it returns the mode (the most frequent value) of all votes.

func (*CatBallotBox) TallyError

func (bb *CatBallotBox) TallyError(feature *Feature) (e float64)

Tally error returns the balanced classification error for categorical features.

1 - sum((sum(Y(xi)=Y'(xi))/|xi|))

where Y are the labels Y' are the estimated labels xi is the set of samples with the ith actual label

Case for which the true category is not known are ignored.

func (*CatBallotBox) Vote

func (bb *CatBallotBox) Vote(casei int, pred string, weight float64)

Vote registers a vote that case "casei" should be predicted to be the category "pred".

type CatMap

type CatMap struct {
	Map  map[string]int //map categories from string to Num
	Back []string       // map categories from Num to string
}

CatMap is for mapping categorical values to integers. It contains:

Map  : a map of ints by the string used for the category
Back : a slice of strings by the int that represents them

And is embedded by Feature and CatBallotBox.

func (*CatMap) CatToNum

func (cm *CatMap) CatToNum(value string) (numericv int)

CatToNum provides the int equivalent of the provided categorical value if it already exists or adds it to the map and returns the new value if it doesn't.

func (*CatMap) NCats

func (cm *CatMap) NCats() (n int)

type EntropyTarget

type EntropyTarget struct {
	*Feature
}

EntropyTarget wraps a categorical feature for use in entropy driven classification as in Ross Quinlan's ID3 (Iterative Dichotomizer 3).

func NewEntropyTarget

func NewEntropyTarget(f *Feature) *EntropyTarget

NewEntropyTarget creates a RefretTarget and initializes EntropyTarget.Costs to the proper length.

func (*EntropyTarget) Impurity

func (target *EntropyTarget) Impurity(cases *[]int, counts *[]int) (e float64)

EntropyTarget.Impurity implements categorical entropy as sum(pj*log2(pj)) where pj is the number of cases with the j'th category over the total number of cases.

func (*EntropyTarget) SplitImpurity

func (target *EntropyTarget) SplitImpurity(l []int, r []int, counter *[]int) (impurityDecrease float64)

EntropyTarget.SplitImpurity is a version of Split Impurity that calls EntropyTarget.Impurity

type Feature

type Feature struct {
	*CatMap
	NumData      []float64
	CatData      []int
	Missing      []bool
	Numerical    bool
	Name         string
	RandomSearch bool
}

Feature is a structure representing a single feature in a feature matrix. It contains: An embedded CatMap (may only be instantiated for cat data)

NumData   : A slice of floates used for numerical data and nil otherwise
CatData   : A slice of ints for categorical data and nil otherwise
Missing   : A slice of bools indicating missing values. Measure this for length.
Numerical : is the feature numerical
Name      : the name of the feature

func ParseFeature

func ParseFeature(record []string) Feature

ParseFeature parses a Feature from an array of strings and a capacity capacity is the number of cases and will usually be len(record)-1 but but doesn't need to be calculated for every row of a large file. The type of the feature us inferred from the start of the first (header) field in record: "N:"" indicating numerical, anything else (usually "C:" and "B:") for categorical

func (*Feature) BestCatSplit

func (f *Feature) BestCatSplit(target Target,
	cases *[]int,
	parentImp float64,
	maxEx int,
	allocs *BestSplitAllocs) (bestSplit int, impurityDecrease float64)

BestCatSplit performs an exhaustive search for the split that minimizes impurity in the specified target for categorical features with less then 31 categories. It expects to be provided for cases fir which the feature is not missing.

This implementation follows Brieman's implementation and the R/Matlab implementations based on it use exhaustive search for when there are less than 25/10 categories and random splits above that.

Searching is implemented via bitwise operations vs an incrementing or random int (32 bit) for speed but will currently only work when there are less then 31 categories. Use one of the Big functions above that.

The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*Feature) BestCatSplitBig

func (f *Feature) BestCatSplitBig(target Target, cases *[]int, parentImp float64, maxEx int, allocs *BestSplitAllocs) (bestSplit *big.Int, impurityDecrease float64)

BestCatSplitBig performs a random/exhaustive search to find the split that minimizes impurity in the specified target. It expects to be provided for cases fir which the feature is not missing.

Searching is implemented via bitwise on Big.Ints to handle large n categorical features but BestCatSplit should be used for n <31.

The best split is returned as a BigInt for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*Feature) BestCatSplitIter

func (f *Feature) BestCatSplitIter(target Target, cases *[]int, parentImp float64, allocs *BestSplitAllocs) (bestSplit int, impurityDecrease float64)

BestCatSplitIter performs an iterative search to find the split that minimizes impurity in the specified target. It expects to be provided for cases fir which the feature is not missing.

Searching is implemented via bitwise ops on ints (32 bit) for speed but will currently only work when there are <31 categories. Use BigInterBestCatSplit above that.

The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*Feature) BestCatSplitIterBig

func (f *Feature) BestCatSplitIterBig(target Target, cases *[]int, parentImp float64, allocs *BestSplitAllocs) (bestSplit *big.Int, impurityDecrease float64)

BestCatSplitIterBig performs an iterative search to find the split that minimizes impurity in the specified target. It expects to be provided for cases fir which the feature is not missing.

Searching is implemented via bitwise on integers for speed but will currently only work when there are less categories then the number of bits in an int.

The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*Feature) BestNumSplit

func (f *Feature) BestNumSplit(target Target,
	cases *[]int,
	parentImp float64,
	allocs *BestSplitAllocs) (bestSplit float64, impurityDecrease float64)

BestNumSsplit searches over the possible splits of cases that can be made with f and returns the one that minimizes the impurity of the target and the impurity decrease.

It expects to be provided for cases fir which the feature is not missing.

It searches by sorting the cases by the potential splitter and then evaluating each "gap" between cases with non equal value as a potential split.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*Feature) BestSplit

func (f *Feature) BestSplit(target Target,
	cases *[]int,
	parentImp float64,
	allocs *BestSplitAllocs) (bestNum float64, bestCat int, bestBigCat *big.Int, impurityDecrease float64)

BestSplit finds the best split of the features that can be achieved using the specified target and cases. It returns a Splitter and the decrease in impurity.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*Feature) DecodeSplit

func (f *Feature) DecodeSplit(num float64, cat int, bigCat *big.Int) (s *Splitter)

Decode split builds a splitter from the numeric values returned by BestNumSplit or BestCatSplit. Numeric splitters are decoded to send values <= num left. Categorical splitters are decoded to send categorical values for which the bit in cat is 1 left.

func (*Feature) FilterMissing

func (f *Feature) FilterMissing(cases *[]int, filtered *[]int)

FilterMissing loops over the cases and appends them into filtered. For most use cases filtered should have zero length before you begin as it is not reset internally

func (*Feature) FindPredicted

func (f *Feature) FindPredicted(cases []int) (pred string)

Find predicted takes the indexes of a set of cases and returns the predicted value. For categorical features this is a string containing the most common category and for numerical it is the mean of the values.

func (*Feature) Gini

func (target *Feature) Gini(cases *[]int) (e float64)

Gini returns the gini impurity for the specified cases in the feature gini impurity is calculated as 1 - Sum(fi^2) where fi is the fraction of cases in the ith catagory.

func (*Feature) GiniWithoutAlocate

func (target *Feature) GiniWithoutAlocate(cases *[]int, counts *[]int) (e float64)

giniWithoutAlocate calculates gini impurity using the supplied counter which must be a slice with length equal to the number of cases. This allows you to reduce allocations but the counter will also contain per category counts.

func (*Feature) Impurity

func (target *Feature) Impurity(cases *[]int, counter *[]int) (e float64)

Impurity returns Gini impurity or mean squared error vs the mean for a set of cases depending on weather the feature is categorical or numerical

func (*Feature) ImputeMissing

func (f *Feature) ImputeMissing()

ImputeMissing imputes the missing values in a feature to the mean or mode of the feature.

func (*Feature) Mean

func (target *Feature) Mean(cases *[]int) (m float64)

Mean returns the mean of the feature for the cases specified

func (*Feature) MeanSquaredError

func (target *Feature) MeanSquaredError(cases *[]int, predicted float64) (e float64)

MeanSquaredError returns the Mean Squared error of the cases specified vs the predicted value. Only non missing cases are considered.

func (*Feature) Mode

func (f *Feature) Mode(cases *[]int) (m string)

Mode returns the mode category feature for the cases specified

func (*Feature) Modei

func (f *Feature) Modei(cases *[]int) (m int)

Mode returns the mode category feature for the cases specified

func (*Feature) NumImp

func (target *Feature) NumImp(cases *[]int) (e float64)

Numerical Impurity returns the mean squared error vs the mean calculated with a two pass algorythem.

func (*Feature) ShuffledCopy

func (f *Feature) ShuffledCopy() (fake *Feature)

ShuffledCopy returns a shuffled version of f for use as an artificial contrast in evaluation of importance scores. The new feature will be named featurename:SHUFFLED

func (*Feature) SplitImpurity

func (target *Feature) SplitImpurity(l []int, r []int, counter *[]int) (impurityDecrease float64)

SplitImpurity calculates the impurity of a split into the specified left and right groups. This is defined as pLi*(tL)+pR*i(tR) where pL and pR are the probability of case going left or right and i(tl) i(tR) are the left and right impurities.

Counter is only used for categorical targets and should have the same length as the number of categories in the target.

type FeatureMatrix

type FeatureMatrix struct {
	Data       []Feature
	Map        map[string]int
	CaseLabels []string
}

FeatureMatrix contains a slice of Features and a Map to look of the index of a feature by its string id.

func ParseAFM

func ParseAFM(input io.Reader) *FeatureMatrix

Parse an AFM (annotated feature matrix) out of an io.Reader AFM format is a tsv with row and column headers where the row headers start with N: indicating numerical, C: indicating categorical or B: indicating boolean For this parser features without N: are assumed to be categorical

func (*FeatureMatrix) AddContrasts

func (fm *FeatureMatrix) AddContrasts(n int)

AddContrasts appends n artificial contrast features to a feature matrix. These features are generated by randomly selecting (with replacement) an existing feature and creating a shuffled copy named featurename:SHUFFLED.

These features can be used as a contrast to evaluate the importance score's assigned to actual features.

func (*FeatureMatrix) BestSplitter

func (fm *FeatureMatrix) BestSplitter(target Target,
	cases []int,
	candidates []int,
	allocs *BestSplitAllocs) (s *Splitter, impurityDecrease float64)

BestSplitter finds the best splitter from a number of candidate features to slit on by looping over all features and calling BestSplit.

itter tells the splitter to use iterative (instead of random) searches for large categorical features.

splitmissing tells the splitter to keep missing features in a third branch at each node.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*FeatureMatrix) ContrastAll

func (fm *FeatureMatrix) ContrastAll()

ContrastAll adds shuffled copies of every feature to the feature matrix. These features are generated by randomly selecting (with replacement) an existing feature and creating a shuffled copy named featurename:SHUFFLED.

These features can be used as a contrast to evaluate the importance score's assigned to actual features. ContrastAll is particularly useful vs AddContrast when one wishes to identify [pseudo] unique identifiers that might lead to over fitting.

func (*FeatureMatrix) ImputeMissing

func (fm *FeatureMatrix) ImputeMissing()

ImputeMissing imputes missing values in all features to the mean or mode of the feature.

type Forest

type Forest struct {
	//Forest string
	Target string
	Trees  []*Tree
}

Forest represents a collection of decision trees grown to predict Target.

func GrowRandomForest

func GrowRandomForest(fm *FeatureMatrix,
	target *Feature,
	nSamples int,
	mTry int,
	nTrees int,
	leafSize int,
	splitmissing bool,
	importance *[]*RunningMean) (f *Forest)

type ForestReader

type ForestReader struct {
	// contains filtered or unexported fields
}

ForestReader wraps an io.Reader to reads a forest. It includes ReadForest for reading an entire forest or ReadTree for reading a forest tree by tree. The forest should be in .sf format see the package doc's in doc.go for full format details. It ignores fields that are not use by CloudForest.

func NewForestReader

func NewForestReader(r io.Reader) *ForestReader

NewForestReader wraps the supplied io.Reader as a ForestReader.

func (*ForestReader) ParseRfAcePredictorLine

func (fr *ForestReader) ParseRfAcePredictorLine(line string) map[string]string

ParseRfAcePredictorLine parses a single line of an rf-ace sf "stochastic forest" and returns a map[string]string of the key value pairs.

func (*ForestReader) ReadForest

func (fr *ForestReader) ReadForest() (forest *Forest, err error)

ForestReader.ReadForest reads the next forest from the underlying reader. If io.EOF or another error is encountered it returns that.

func (*ForestReader) ReadTree

func (fr *ForestReader) ReadTree() (tree *Tree, forest *Forest, err error)

ForestReader.ReadTree reads the next tree from the underlying reader. If the next tree is in a new forest it returns a forest object as well. If an io.EOF or other error is encountered it returns that as well as any partially parsed structs.

type ForestWriter

type ForestWriter struct {
	// contains filtered or unexported fields
}

ForestWriter wraps an io writer with functionality to write forests either with one call to WriteForest or incrementally using WriteForestHeader and WriteTree. ForestWriter saves a forest in .sf format; see the package doc's in doc.go for full format details. It won't include fields that are not use by CloudForest.

func NewForestWriter

func NewForestWriter(w io.Writer) *ForestWriter

NewForestWriter returns a pointer to a new ForestWriter.

func (*ForestWriter) DescribeMap

func (fw *ForestWriter) DescribeMap(input map[string]bool) string

DescribeMap serializes the "left" map of a categorical splitter.

func (*ForestWriter) WriteForest

func (fw *ForestWriter) WriteForest(forest *Forest)

WriteForest writes an entire forest including all headers.

func (*ForestWriter) WriteForestHeader

func (fw *ForestWriter) WriteForestHeader(target string, ntrees int)

WriteForestHeader writes only the header of a forest.

func (*ForestWriter) WriteNode

func (fw *ForestWriter) WriteNode(n *Node, path string)

WriteNode writes a single node but not it's children. WriteTree will be used more often but WriteNode can be used to grow a large tree directly to disk without storing it in memory.

func (*ForestWriter) WriteNodeAndChildren

func (fw *ForestWriter) WriteNodeAndChildren(n *Node, path string)

WriteNodeAndChildren recursively writes out the target node and all of its children. WriteTree is preferred for most use cases.

func (*ForestWriter) WriteTree

func (fw *ForestWriter) WriteTree(tree *Tree, ntree int)

WriteTree writes an entire Tree including the header.

func (*ForestWriter) WriteTreeHeader

func (fw *ForestWriter) WriteTreeHeader(ntree int, target string, weight float64)

WrieTreeHeader writes only the header line for a tree.

type GradBoostTarget

type GradBoostTarget struct {
	*Feature
	LearnRate float64
}

GradBoostTarget wraps a numerical feature as a target for us in Adaptive Boosting (AdaBoost)

func (*GradBoostTarget) Boost

func (f *GradBoostTarget) Boost(leaves *[][]int) (weight float64)

func (*GradBoostTarget) Update

func (f *GradBoostTarget) Update(cases *[]int)

Update updates the underlying numeric data by subtracting the mean*weight of the specified cases from the value for those cases.

type L1Target

type L1Target struct {
	*Feature
}

L1Target wraps a numerical feature as a target for us in l1 norm regression.

func (*L1Target) Impurity

func (target *L1Target) Impurity(cases *[]int, counter *[]int) (e float64)

L1Target.Impurity is an L1 version of impurity returning L1 instead of squared error.

func (*L1Target) MeanL1Error

func (target *L1Target) MeanL1Error(cases *[]int, predicted float64) (e float64)

L1Target.MeanL1Error returns the Mean L1 norm error of the cases specified vs the predicted value. Only non missing cases are considered.

func (*L1Target) SplitImpurity

func (target *L1Target) SplitImpurity(l []int, r []int, counter *[]int) (impurityDecrease float64)

L1Target.SplitImpurity is an L1 version of SplitImpurity.

type Leaf

type Leaf struct {
	Cases []int
	Pred  string
}

Leaf is a struct for storing the index of the cases at a terminal "Leaf" node along with the Numeric predicted value.

type Node

type Node struct {
	Left     *Node
	Right    *Node
	Missing  *Node
	Pred     string
	Splitter *Splitter
}

A node of a decision tree. Pred is a string containing either the category or a representation of a float (less then ideal)

func (*Node) Recurse

func (n *Node) Recurse(r Recursable, fm *FeatureMatrix, cases []int, depth int)

Recurse is used to apply a Recursable function at every downstream node as the cases specified by case []int are split using the data in fm *Featurematrix. Recursion down a branch stops when a a node with n.Splitter == nil is reached. Recursion down the Missing branch is only used if n.Missing!=nil. For example votes can be tabulated using code like:

t.Root.Recurse(func(n *Node, cases []int) {
	if n.Left == nil && n.Right == nil {
		// I'm in a leaf node
		for i := 0; i < len(cases); i++ {
			bb.Vote(cases[i], n.Pred)
		}
	}
}, fm, cases)

type NumBallotBox

type NumBallotBox struct {
	// contains filtered or unexported fields
}

Keeps track of votes by trees. Voteing is thread safe.

func NewNumBallotBox

func NewNumBallotBox(size int) *NumBallotBox

Build a new ballot box for the number of cases specified by "size".

func (*NumBallotBox) Tally

func (bb *NumBallotBox) Tally(i int) (predicted string)

func (*NumBallotBox) TallyError

func (bb *NumBallotBox) TallyError(feature *Feature) (e float64)

Tally error returns the error of the votes vs the provided feature. For categorical features it returns the error rate For numerical features it returns mean squared error. The provided feature must use the same index as the feature matrix the ballot box was constructed with. Missing values are ignored. Gini impurity is not used so this is not for use in rf implementations.

func (*NumBallotBox) TallyNum

func (bb *NumBallotBox) TallyNum(i int) (predicted float64)

TallyNumerical tallies the votes for the case specified by i as if it is a Numerical feature. Ie it returns the mean of all votes.

func (*NumBallotBox) Vote

func (bb *NumBallotBox) Vote(casei int, pred string, weight float64)

Vote parses the float in the string and votes for it

type Recursable

type Recursable func(*Node, []int, int)

Recursable defines a function signature for functions that can be called at every down stream node of a tree as Node.Recurse recurses up the tree. The function should have two parameters, the current node and an array of ints specifying the cases that have not been split away.

type RegretTarget

type RegretTarget struct {
	*Feature
	Costs []float64
}

RegretTarget wraps a categorical feature for use in regret driven classification. The ith entry in costs should contain the cost of misclassifying a case that actually has the ith category.

func NewRegretTarget

func NewRegretTarget(f *Feature) *RegretTarget

NewRegretTarget creates a RefretTarget and initializes RegretTarget.Costs to the proper length.

func (*RegretTarget) Impurity

func (target *RegretTarget) Impurity(cases *[]int, counter *[]int) (e float64)

RegretTarget.Impurity implements a simple regret function that finds the average cost of a set using the misclassification costs in RegretTarget.Costs.

func (*RegretTarget) SetCosts

func (target *RegretTarget) SetCosts(costmap map[string]float64)

RegretTarget.SetCosts puts costs in a map[string]float64 by feature name into the proper entries in RegretTarget.Costs.

func (*RegretTarget) SplitImpurity

func (target *RegretTarget) SplitImpurity(l []int, r []int, counter *[]int) (impurityDecrease float64)

RegretTarget.SplitImpurity is a version of Split Impurity that calls RegretTarget.Impurity

type RunningMean

type RunningMean struct {
	Mean  float64
	Count float64
	// contains filtered or unexported fields
}

RunningMean is a thread safe strut for keeping track of running means as used in importance calculations. (TODO: could this be made lock free?)

func (*RunningMean) Add

func (rm *RunningMean) Add(val float64)

RunningMean.Add add's the specified value to the running mean in a thread safe way.

func (*RunningMean) Read

func (rm *RunningMean) Read() (mean float64, count float64)

RunningMean.Read reads the mean and count

func (*RunningMean) WeightedAdd

func (rm *RunningMean) WeightedAdd(val float64, weight float64)

RunningMean.Add add's the specified value to the running mean in a thread safe way.

type SortableFeature

type SortableFeature struct {
	Feature *Feature
	Cases   []int
}

Sortable feature is a wrapper for a feature and set of cases that satisfies the sort.Interface interface so that the case indexes in Cases can be sorted using sort.Sort

func (SortableFeature) Len

func (sf SortableFeature) Len() int

Len returns the number of cases.

func (SortableFeature) Less

func (sf SortableFeature) Less(i int, j int) bool

Less determines if the ith case is less then the jth case.

func (SortableFeature) Swap

func (sf SortableFeature) Swap(i int, j int)

Swap exchanges the ith and jth cases.

type SparseCounter

type SparseCounter struct {
	Map map[int]map[int]int
}

Sparse counter uses maps to track sparse integer counts in large matrix. The matrix is assumed to contain zero values where nothing has been added.

func (*SparseCounter) Add

func (sc *SparseCounter) Add(i int, j int, val int)

Add increases the count in i,j by val.

func (*SparseCounter) WriteTsv

func (sc *SparseCounter) WriteTsv(writer io.Writer)

Write tsv writes the non zero counts out into a three column tsv containing i, j, and count in the columns.

type Splitter

type Splitter struct {
	Feature   string
	Numerical bool
	Value     float64
	Left      map[string]bool
}

Splitter contains fields that can be used to cases by a single feature. The split can be either numerical in which case it is defined by the Value field or categorical in which case it is defined by the Left and Right fields.

func (*Splitter) Split

func (s *Splitter) Split(fm *FeatureMatrix, cases []int) (l []int, r []int, m []int)

Splitter.Split splits a slice of cases into left, right and missing slices without allocating a new underlying array by sorting cases into left, missing, right order and returning slices that point to the left and right cases.

type Target

type Target interface {
	NCats() (n int)
	SplitImpurity(l []int, r []int, counter *[]int) (impurityDecrease float64)
	Impurity(cases *[]int, counter *[]int) (impurity float64)
	FindPredicted(cases []int) (pred string)
}

Target abstracts the methods needed for a feature to be predictable as either a catagroical or numerical feature in a random forest.

type Tree

type Tree struct {
	//Tree int
	Root   *Node
	Target string
	Weight float64
}

Tree represents a single decision tree.

func NewTree

func NewTree() *Tree

func (*Tree) AddNode

func (t *Tree) AddNode(path string, pred string, splitter *Splitter)

AddNode adds a node a the specified path with the specified pred value and/or Splitter. Paths are specified in the same format as in rf-aces sf files, as a string of 'L' and 'R'. Nodes must be added from the root up as the case where the path specifies a node whose parent does not already exist in the tree is not handled well.

func (*Tree) GetLeaves

func (t *Tree) GetLeaves(fm *FeatureMatrix, fbycase *SparseCounter) []Leaf

GetLeaves is called by the leaf count utility to gather statistics about the nodes of a tree including the sets of cases at "leaf" nodes that aren't split further and the number of times each feature is used to split away each case.

func (*Tree) GetSplits

func (t *Tree) GetSplits(fm *FeatureMatrix, fbycase *SparseCounter, relativeSplitCount *SparseCounter) []Splitter

GetSplits returns the arrays of all Numeric splitters of a tree.

func (*Tree) Grow

func (t *Tree) Grow(fm *FeatureMatrix,
	target Target,
	cases []int,
	candidates []int,
	mTry int,
	leafSize int,
	splitmissing bool,
	importance *[]*RunningMean,
	allocs *BestSplitAllocs)

tree.Grow grows the receiver tree through recursion. It uses impurity decrease to select splitters at each node as in Brieman's Random Forest. It should be called on a tree with only a root node defined.

fm is a feature matrix of training data.

target is the feature to predict via regression or classification as determined by feature type.

cases specifies the cases to calculate impurity decrease over and can contain repeated values to allow for sampling of cases with replacement as in RF.

mTry specifies the number of candidate features to evaluate for each split.

leafSize specifies the minimum number of cases at a leafNode.

func (*Tree) Partition

func (t *Tree) Partition(fm *FeatureMatrix) *[][]int

func (*Tree) Vote

func (t *Tree) Vote(fm *FeatureMatrix, bb VoteTallyer)

Tree.Vote casts a vote for the predicted value of each case in fm *FeatureMatrix. into bb *BallotBox. Since BallotBox is not thread safe trees should not vote into the same BallotBox in parallel.

func (*Tree) VoteCases

func (t *Tree) VoteCases(fm *FeatureMatrix, bb VoteTallyer, cases []int)

Tree.VoteCases casts a vote for the predicted value of each case in fm *FeatureMatrix. into bb *BallotBox. Since BallotBox is not thread safe trees should not vote into the same BallotBox in parallel.

type VoteTallyer

type VoteTallyer interface {
	Vote(casei int, pred string, weight float64)
	TallyError(feature *Feature) float64
	Tally(casei int) string
}

VoteTallyer is used to tabulate votes by trees and is implemented by feature type specific structs like NumBallotBox and CatBallotBox. Vote should register a cote that casei should be predicted as pred. TallyError returns the error vs the supplied feature.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL