CloudForest

package module
v0.0.0-...-8f151e4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 5, 2022 License: BSD-3-Clause Imports: 15 Imported by: 25

README

CloudForest

Build Status GoDoc

Google Group

Fast, flexible, multi-threaded ensembles of decision trees for machine learning in pure Go (golang).

CloudForest allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical / categorical data with missing values. These include:

  • Breiman and Cutler's Random Forest for Classification and Regression
  • Adaptive Boosting (AdaBoost) Classification
  • Gradient Boosting Tree Regression and Two Class Classification
  • Hellinger Distance Trees for Classification
  • Entropy, Cost driven and Class Weighted classification
  • L1/Absolute Deviance Decision Tree regression
  • Improved Feature Selection via artificial contrasts with ensembles (ACE)
  • Roughly Balanced Bagging for Unbalanced Data
  • Improved robustness using out of bag cases and artificial contrasts.
  • Support for missing values via bias correction or three way splitting.
  • Proximity/Affinity Analysis suitable for manifold learning
  • A number of experimental splitting criteria

The Design Prioritizes:

  • Training speed
  • Performance on highly dimensional heterogeneous datasets (e.g. genetic and clinical data).
  • An optimized set of core functionality.
  • The flexibility to quickly implement new impurities and algorithms using the common core.
  • The ability to natively handle non numerical data types and missing values.
  • Use in a multi core or multi machine environment.

It can achieve quicker training times then many other popular implementations on some datasets. This is the result of cpu cache friendly memory utilization well suited to modern processors and separate, optimized paths to learn splits from binary, numerical and categorical data.

Benchmarks

CloudForest offers good general accuracy and the alternative and augmented algorithms it implements can offer reduced error rate for specific use cases including especially recovering a signal from noisy, high dimensional data prone to over-fitting and predicting rare events and unbalanced classes (both of which are typical in genetic studies of diseases). These methods should be included in parameter sweeps to maximize accuracy.

Error

(Work on benchmarks and optimization is ongoing, if you find a slow use case please raise an issue.)

Command line utilities to grow, apply and analyze forests and do cross validation are provided or CloudForest can be used as a library in go programs.

This Document covers command line usage, file formats and some algorithmic background.

Documentation for coding against CloudForest has been generated with godoc and can be viewed live at: http://godoc.org/github.com/ryanbressler/CloudForest

Pull requests, spelling corrections and bug reports are welcome; Code Repo and Issue tracker can be found at: https://github.com/ryanbressler/CloudForest

A google discussion group can be found at: https://groups.google.com/forum/#!forum/cloudforest-dev

CloudForest was created in the Shumelivich Lab at the Institute for Systems Biology.

(Build status includes accuracy tests on iris and Boston housing price datasets and multiple go versions.)

Installation

With go installed:

go get github.com/ryanbressler/CloudForest
go install github.com/ryanbressler/CloudForest/growforest
go install github.com/ryanbressler/CloudForest/applyforest

#optional utilities
go install github.com/ryanbressler/CloudForest/leafcount
go install github.com/ryanbressler/CloudForest/utils/nfold
go install github.com/ryanbressler/CloudForest/utils/toafm

To update to the latest version use the -u flag

go get -u github.com/ryanbressler/CloudForest
go install -u github.com/ryanbressler/CloudForest/growforest
go install -u github.com/ryanbressler/CloudForest/applyforest

#optional utilities
go install -u github.com/ryanbressler/CloudForest/leafcount
go install -u github.com/ryanbressler/CloudForest/utils/nfold
go install -u github.com/ryanbressler/CloudForest/utils/toafm

Quick Start

Data can be provided in a tsv based anotated feature matrix or in arff or libsvm formats with ".arff" or ".libsvm" extensions. Details are discussed in the Data File Formats section below and a few example data sets are included in the "data" directory.

#grow a predictor forest with default parameters and save it to forest.sf
growforest -train train.fm -rfpred forest.sf -target B:FeatureName

#grow a 1000 tree forest using, 16 cores and report out of bag error 
#with minimum leafSize 8 
growforest -train train.fm -rfpred forest.sf -target B:FeatureName -oob \
-nCores 16 -nTrees 1000 -leafSize 8

#grow a 1000 tree forest evaluating half the features as candidates at each 
#split and reporting out of bag error after each tree to watch for convergence
growforest -train train.fm -rfpred forest.sf -target B:FeatureName -mTry .5 -progress 

#growforest with weighted random forest
growforest -train train.fm -rfpred forest.sf -target B:FeatureName \
-rfweights '{"true":2,"false":0.5}'

#report all growforest options
growforest -h

#Print the (balanced for classification, least squares for regression error 
#rate on test data to standard out
applyforest -fm test.fm -rfpred forest.sf

#Apply the forest, report errorrate and save predictions
#Predictions are output in a tsv as:
#CaseLabel	Predicted	Actual
applyforest -fm test.fm -rfpred forest.sf -preds predictions.tsv

#Calculate counts of case vs case (leaves) and case vs feature (branches) proximity.
#Leaves are reported as:
#Case1 Case2 Count
#Branches Are Reported as:
#Case Feature Count
leafcount -train train.fm -rfpred forest.sf -leaves leaves.tsv -branches branches.tsv

#Generate training and testing folds
nfold -fm data.fm

#growforest with internal training and testing
growforest -train train_0.fm -target N:FeatureName -test test_0.fm

#growforest with internal training and testing, 10 ace feature selection permutations and
#testing performed only using significant features
growforest -train train_0.fm -target N:FeatureName -test test_0.fm -ace 10 -cutoff .05

Growforest Utility

growforest trains a forest using the following parameters which can be listed with -h

Parameter's are implemented using go's parameter parser so that boolean parameters can be set to true with a simple flag:

#the following are equivalent
growforest -oob
growforest -oob=true

And equals signs and quotes are optional for other parameters:

#the following are equivalent
growforest -train featurematrix.afm
growforest -train="featurematrix.afm"
Basic options
  -target="": The row header of the target in the feature matrix.
  -train="featurematrix.afm": AFM formated feature matrix containing training data.
  -rfpred="rface.sf": File name to output predictor forest in sf format.
  -leafSize="0": The minimum number of cases on a leaf node. If <=0 will be inferred to 1 for classification 4 for regression.
  -maxDepth=0: Maximum tree depth. Ignored if 0.
  -mTry="0": Number of candidate features for each split as a count (ex: 10) or portion of total (ex: .5). Ceil(sqrt(nFeatures)) if <=0.
  -nSamples="0": The number of cases to sample (with replacement) for each tree as a count (ex: 10) or portion of total (ex: .5). If <=0 set to total number of cases.
  -nTrees=100: Number of trees to grow in the predictor.
 
  -importance="": File name to output importance.

  -oob=false: Calculate and report oob error.
 
Advanced Options
  -blacklist="": A list of feature id's to exclude from the set of predictors.
  -includeRE="": Filter features that DON'T match this RE.
  -blockRE="": A regular expression to identify features that should be filtered out.
  -force=false: Force at least one non constant feature to be tested for each split as in scikit-learn.
  -impute=false: Impute missing values to feature mean/mode before growth.
  -nCores=1: The number of cores to use.
  -progress=false: Report tree number and running oob error.
  -oobpreds="": Calculate and report oob predictions in the file specified.
  -cpuprofile="": write cpu profile to file
  -multiboost=false: Allow multi-threaded boosting which may have unexpected results. (highly experimental)
  -nobag=false: Don't bag samples for each tree.
  -evaloob=false: Evaluate potential splitting features on OOB cases after finding split value in bag.
  -selftest=false: Test the forest on the data and report accuracy.
  -splitmissing=false: Split missing values onto a third branch at each node (experimental).
  -test="": Data to test the model on after training.
Regression Options
  -gbt=0: Use gradient boosting with the specified learning rate.
  -l1=false: Use l1 norm regression (target must be numeric).
  -ordinal=false: Use ordinal regression (target must be numeric).
Classification Options
  -adaboost=false: Use Adaptive boosting for classification.
  -balanceby="": Roughly balanced bag the target within each class of this feature.
  -balance=false: Balance bagging of samples by target class for unbalanced classification.
  -cost="": For categorical targets, a json string to float map of the cost of falsely identifying each category.
  -entropy=false: Use entropy minimizing classification (target must be categorical).
  -hellinger=false: Build trees using hellinger distance.
  -positive="True": Positive class to output probabilities for.
  -rfweights="": For categorical targets, a json string to float map of the weights to use for each category in Weighted RF.
  -NP=false: Do approximate Neyman-Pearson classification.
  -NP_a=0.1: Constraint on percision in NP classification [0,1]
  -NP_k=100: Weight of constraint in NP classification [0,Inf+)
  -NP_pos="1": Class label to constrain percision in NP classification.

Note: rfweights and cost should use json to specify the weights and or costs per class using the strings used to represent the class in the boolean or categorical feature:

   growforest -rfweights '{"true":2,"false":0.5}'
Randomizing Data and Artificial Contrasts

Randomizing shuffling parts of the data or including shuffled "Artifichal Contrasts" can be useful to establish baselines for comparison.

The "vet" option extends the principle to tree growth. When evaluating potential splitters it subtracts the impurity decrease from the best split candidate splitters can make on a shuffled target from the impurity decrease of the actual best split. This is intended to penalizes certain types of features that contribute to over-fitting including unique identifiers and sparse features

  -ace=0: Number ace permutations to do. Output ace style importance and p values.
  -permute: Permute the target feature (to establish random predictive power).
  -contrastall=false: Include a shuffled artificial contrast copy of every feature.
  -nContrasts=0: The number of randomized artificial contrast features to include in the feature matrix.
  -shuffleRE="": A regular expression to identify features that should be shuffled.
  -vet=false: Penalize potential splitter impurity decrease by subtracting the best split of a permuted target.

Applyforrest Utility

applyforest applies a forest to the specified feature matrix and outputs predictions as a two column (caselabel predictedvalue) tsv.

Usage of applyforest:
  -expit=false: Expit (inverst logit) transform data (for gradient boosting classification).
  -fm="featurematrix.afm": AFM formated feature matrix containing data.
  -mean=false: Force numeric (mean) voting.
  -mode=false: Force categorical (mode) voting.
  -preds="": The name of a file to write the predictions into.
  -rfpred="rface.sf": A predictor forest.
  -sum=false: Force numeric sum voting (for gradient boosting etc).
  -votes="": The name of a file to write categorical vote totals to.

Leafcount Utility

leafcount outputs counts of case case co-occurrence on leaf nodes (leaves.tsv, Brieman's proximity) and counts of the number of times a feature is used to split a node containing each case (branches.tsv a measure of relative/local importance).

Usage of leafcount:
  -branches="branches.tsv": a case by feature sparse matrix of leaf co-occurrence in tsv format
  -fm="featurematrix.afm": AFM formated feature matrix to use.
  -leaves="leaves.tsv": a case by case sparse matrix of leaf co-occurrence in tsv format
  -rfpred="rface.sf": A predictor forest.

nfold utility

nfold is a utility for generating cross validation folds. It can read in and ouput any of the supported formats. You can specify a catagorical target feature to do stratified sampeling which will balance the classes between the folds.

If no target feature is specified, a numerical target feature is specified or the -unstratified option is provided unstratified sampeling will be used.

Usage of nfold:
  -fm="featurematrix.afm": AFM formated feature matrix containing data.
  -folds=5: Number of folds to generate.
  -target="": The row header of the target in the feature matrix.
  -test="test_%v.fm": Format string for testing fms.
  -train="train_%v.fm": Format string for training fms.
  -unstratified=false: Force unstratified sampeling of categorical target.
  -writeall=false: Output all three formats.
  -writearff=false: Output arff.
  -writelibsvm=false: Output libsvm.

Importance

Variable Importance in CloudForest is based on the as the mean decrease in impurity over all of the splits made using a feature. It is output in a tsv as:

0 1 2 3 4 5 6
Feature Decrease Per Use Use Count Decrease Per Tree Decrease Per Tree Used Tree Used Count Mean Minimal Depth

Decrease per tree (col 3 starting from 0) is the most common definition of importance in other implementations and is calculated over all trees, not just the ones the feature was used in.

Each of these scores has different properties:

  • Per-use and per-tree-used scores may be more resistant to feature redundancy,
  • Per-tree-used and per-tree scores may better pick out complex effects.
  • Mean Minimal Depth has been proposed (see "Random Survival Forests") as an alternative importance.

To provide a baseline for evaluating importance, artificial contrast features can be used by including shuffled copies of existing features (-nContrasts, -contrastAll).

A feature that performs well when randomized (or when the target has been randomized) may be causing over-fitting.

The option to permute the target (-permute) will establish a minimum random baseline. Using a regular expression (-shuffleRE) to shuffle part of the data can be useful in teasing out the contributions of different subsets of features.

Importance with P-Values Via Artificial Contrasts/ACE

P-values can be established for importance scores by comparing the importance score for each feature to that of shuffled copy of itself or artificial contrast over a number of runs. This algorithm is described in Tuv's "Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination."

Feature selection based on these p-values can increase the model's resistance to issues including over-fitting from high cardinality features.

In CloudForest these p-values are produces with a Welch's t-test and the null hypthesis that the mean importance of a features contrasts is greater then that of the feature itself over all of the forests. To use this method specify the number of forests/repeats to perform using the "-ace" option and provide a file name for importance scores via the -importance option. Importance scores will be the mean decrease per tree over all of the forests.

growforest -train housing.arff -target class -ace 10 -importance bostanimp.tsv

The output tsv will be a tsv with the following columns:

0 1 2 3
target predictor p-value mean importance

This method is often combined with the -evaloob method described bellow.

growforest -train housing.arff -target class -ace 10 -importance bostanimp.tsv -evaloob

Improved Feature Selection

Genomic data is frequently has many noisy, high cardinality, uninformative features which can lead to in bag over fitting. To combat this, CloudForest implements some methods designed to help better filter out uninformative features.

The -evaloob method evaluates potential best splitting features on the oob data after learning the split value for each splitter as normal from the in bag/branch data as normal. Importance scores are also calcualted using OOB cases. This idea is discussed in Eugene Tuv, Alexander Borisov, George Runger and Kari Torkkola's paper "Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination."

The -vet option penalizes the impurity decrease of potential best split by subtracting the best split they can make after the target values cases on which the split is being evaluated have been shuffled.

In testing so far evaloob provides better performance and is less computationally intensive. These options can be used together which may provide the best performance in very noisy data. When used together vetting is also done on the out of bag cases.

Data With Unbalanced Classes

Genomic studies also frequently have unbalanced target classes. Ie you might be interested in a rare disease but have samples drawn from the general population. CloudForest implements three methods for dealing with such studies, roughly balanced bagging (-balance), cost weighted classification (-costs) and weighted gini impurity driven classification (-rfweights). See the references bellow for a discussion of these options.

Missing Values

By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data:

            I(split) = p(l)I(l)+p(r)I(r)+p(m)I(m)

Missing values in the target variable are left out of impurity calculations.

This provided generally good results at a fraction of the computational costs of imputing data.

Optionally, -impute can be called before forest growth to impute missing values to the feature mean/mode which Brieman suggests as a fast method for imputing values.

This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes.

Experimental support (-splitmissing) is provided for 3 way splitting which splits missing cases onto a third branch. This has so far yielded mixed results in testing.

Data File Formats

Data files in cloud forest are assumed to be in our Anotated Feature Matrix tsv based format unless a .libsvm or .arff file extension is used.

Anotated Feature Matrix Tsv Files

CloudForest borrows the annotated feature matrix (.afm) and stochastic forest (.sf) file formats from Timo Erkkila's rf-ace which can be found at https://code.google.com/p/rf-ace/

An annotated feature matrix (.afm) file is a tab delineated file with column and row headers. By default columns represent cases and rows represent features/variables though the transpose (rows as cases/observations) is also detected and supported.

A row header / feature id includes a prefix to specify the feature type. These prefixes are also used to detect column vs row orientation.

"N:" Prefix for numerical feature id.
"C:" Prefix for categorical feature id.
"B:" Prefix for boolean feature id.

Categorical and boolean features use strings for their category labels. Missing values are represented by "?","nan","na", or "null" (case insensitive). A short example:

featureid	case1	case2	case3
N:NumF1	0.0	.1	na
C:CatF2 red	red	green

Some sample feature matrix data files are included in the "data" directory.

ARFF Data Files

CloudFores also supports limited import of weka's ARFF format. This format will be detected via the ".arff" file extension. Only numeric and nominal/catagorical attributes are supported, all other attribute types will be assumed to be catagorical and should usully be removed or blacklisted. There is no support for spaces in feature names, quoted strings or sparse data. Trailing space or comments after the data field may cause odd behavior.

The ARFF format also provides an easy way to annotate a cvs file with information about the supplied fields:

@relation data

@attribute NumF1 numeric
@attribute CatF2 {red,green}

@data
0.0,red
.1,red
?,green
LibSvm/Svm Light Data Files

There is also basic support for sparse numerical data in libsvm's file format. This format will be detected by the ".libsvm" file extension and has some limitations. A simple libsvm file might look like:

24.0 1:0.00632 2:18.00 3:2.310 4:0
21.6 1:0.02731 2:0.00 3:7.070 7:78.90
34.7 1:0.02729 2:0.00 5:0.4690

The target field will be given the designation "0" and be in the "0" position of the matrix and you will need to use "-target 0" as an option with growforest. No other feature can have this designation.

The catagorical or numerical nature of the target variable will be detected from the value of the first line. If it is an integer value like 0,1 or 1200 the target will be parsed as catagorical and classification peformed. If it is a floating point value including a decmil place like 1.0, 1.7 etc the target will be parsed as numerical and regession performed. There is currentelly no way to override this behavior.

Models - Stochastic Forest Files

A stochastic forest (.sf) file contains a forest of decision trees. The main advantage of this format as opposed to an established format like json is that an sf file can be written iteratively tree by tree and multiple .sf files can be combined with minimal logic required allowing for massively parallel growth of forests with low memory use.

An .sf file consists of lines each of which is a comma separated list of key value pairs. Lines can designate either a FOREST, TREE, or NODE. Each tree belongs to the preceding forest and each node to the preceding tree. Nodes must be written in order of increasing depth.

CloudForest generates fewer fields then rf-ace but requires the following. Other fields will be ignored

Forest requires forest type (only RF currently), target and ntrees:

FOREST=RF|GBT|..,TARGET="$feature_id",NTREES=int

Tree requires only an int and the value is ignored though the line is needed to designate a new tree:

TREE=int

Node requires a path encoded so that the root node is specified by "*" and each split left or right as "L" or "R". Leaf nodes should also define PRED such as "PRED=1.5" or "PRED=red". Splitter nodes should define SPLITTER with a feature id inside of double quotes, SPLITTERTYPE=[CATEGORICAL|NUMERICAL] and a LVALUE term which can be either a float inside of double quotes representing the highest value sent left or a ":" separated list of categorical values sent left.

NODE=$path,PRED=[float|string],SPLITTER="$feature_id",SPLITTERTYPE=[CATEGORICAL|NUMERICAL] LVALUES="[float|: separated list"

An example .sf file:

FOREST=RF,TARGET="N:CLIN:TermCategory:NB::::",NTREES=12800
TREE=0
NODE=*,PRED=3.48283,SPLITTER="B:SURV:Family_Thyroid:F::::maternal",SPLITTERTYPE=CATEGORICAL,LVALUES="false"
NODE=*L,PRED=3.75
NODE=*R,PRED=1

Cloud forest can parse and apply .sf files generated by at least some versions of rf-ace.

Compiling for Speed

When compiled with go1.1 CloudForest achieves running times similar to implementations in other languages. Using gccgo (4.8.0 at least) results in longer running times and is not recommended. This may change as gcc go adopts the go 1.1 way of implementing closures.

References

The idea for (and trademark of the term) Random Forests originated with Leo Brieman and Adele Cuttler. Their code and paper's can be found at:

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

All code in CloudForest is original but some ideas for methods and optimizations were inspired by Timo Erkilla's rf-ace and Andy Liaw and Matthew Wiener randomForest R package based on Brieman and Cuttler's code:

https://code.google.com/p/rf-ace/ http://cran.r-project.org/web/packages/randomForest/index.html

The idea for Artificial Contrasts is based on: Eugene Tuvand and Kari Torkkola's "Feature Filtering with Ensembles Using Artificial Contrasts" http://enpub.fulton.asu.edu/workshop/FSDM05-Proceedings.pdf#page=74 and Eugene Tuv, Alexander Borisov, George Runger and Kari Torkkola's "Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination" http://www.researchgate.net/publication/220320233_Feature_Selection_with_Ensembles_Artificial_Variables_and_Redundancy_Elimination/file/d912f5058a153a8b35.pdf

The idea for growing trees to minimize categorical entropy comes from Ross Quinlan's ID3: http://en.wikipedia.org/wiki/ID3_algorithm

"The Elements of Statistical Learning" 2nd edition by Trevor Hastie, Robert Tibshirani and Jerome Friedman was also consulted during development.

Methods for classification from unbalanced data are covered in several papers: http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163175/ http://www.biomedcentral.com/1471-2105/11/523 http://bib.oxfordjournals.org/content/early/2012/03/08/bib.bbs006 http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067863

Denisty Estimating Trees/Forests are Discussed: http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p627.pdf http://research.microsoft.com/pubs/158806/CriminisiForests_FoundTrends_2011.pdf The later also introduces the idea of manifold forests which can be learned using down stream analysis of the outputs of leafcount to find the Fiedler vectors of the graph laplacian.

Documentation

Overview

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include:

  • Breiman and Cutler's Random Forest for Classification and Regression

  • Adaptive Boosting (AdaBoost) Classification

  • Gradiant Boosting Tree Regression

  • Entropy and Cost driven classification

  • L1 regression

  • Feature selection with artificial contrasts

  • Proximity and model structure analysis

  • Roughly balanced bagging for unbalanced classification

The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released.

Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest

Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest

Pull requests and bug reports are welcome.

CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute.

Goals

CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code.

Data structures and file formats are chosen with use in multi threaded and cluster environments in mind.

Working with Trees

Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method:

t.Root.Recurse(func(n *Node, innercases []int) {

	if (2 * leafSize) <= len(innercases) {
		SampleFirstN(&candidates, mTry)
		best, impDec := fm.BestSplitter(target, innercases, candidates[:mTry], false, allocs)
		if best != nil && impDec > minImp {
			//not a leaf node so define the splitter and left and right nodes
			//so recursion will continue
			n.Splitter = best
			n.Pred = ""
			n.Left = new(Node)
			n.Right = new(Node)

			return
		}
	}

This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects.

Stackable Interfaces

Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets:

EntropyTarget : For use in entropy minimizing classification
RegretTarget  : For use in classification driven by differing costs in mis-categorization.
L1Target      : For use in L1 norm error regression (which may be less sensitive to outliers).
OrdinalTarget : For ordinal regression

Additional targets can be stacked on top of these target to add boosting functionality:

GradBoostTarget : For Gradient Boosting Regression
AdaBoostTarget  : For Adaptive Boosting Classification

Efficient Splitting

Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks.

Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like:

func(s *Splitter) SplitInPlace(fm *FeatureMatrix, cases []int) (l []int, r []int)

can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed.

Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like:

func (fm *FeatureMatrix) BestSplitter(target Target,
	cases []int,
	candidates []int,
	extraRandom bool,
	allocs *BestSplitAllocs) (s *Splitter, impurityDecrease float64)

func (f *Feature) BestSplit(target Target,
	cases *[]int,
	parentImp float64,
	randomSplit bool,
	allocs *BestSplitAllocs) (bestNum float64, bestCat int, bestBigCat *big.Int, impurityDecrease float64)

For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig.

All numerical predictors are handled by BestNumSplit which relies on go's sorting package.

Parallelism and Scaling

Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk.

Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean.

Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk.

For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc.

Missing Values

By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data:

I(split) = p(l)I(l)+p(r)I(r)+p(m)I(m)

Missing values in the target variable are left out of impurity calculations.

This provided generally good results at a fraction of the computational costs of imputing data.

Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values.

This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes.

Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing.

At some point in the future support may be added for local imputing of missing values during tree growth as described in [3]

[1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

[2] https://code.google.com/p/rf-ace/

[3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record

Main Structures

In CloudForest data is stored using the FeatureMatrix struct which contains Features.

The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split.

The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification.

Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility.

Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func BuildScikitTree

func BuildScikitTree(depth int, n *Node, sktree *ScikitTree)

BuildScikkitTree currentelly only builds the split threshold and node structure of a sickit tree from a Cloudforest tree specified by root node

func Expit

func Expit(x float64) (out float64)

func FriedmanScore

func FriedmanScore(allocs *BestSplitAllocs, l, r *[]int) (impurityDecrease float64)

func Logit

func Logit(x float64) float64

func NewRunningMeans

func NewRunningMeans(size int) *[]*RunningMean

NewRunningMeans returns an initalized *[]*RunningMean.

func ParseAsIntOrFractionOfTotal

func ParseAsIntOrFractionOfTotal(term string, total int) (parsed int)

ParseAsIntOrFractionOfTotal parses strings that may specify an count or a percent of the total for use in specifying paramaters. It parses term as a float if it contains a "." and as an int otherwise. If term is parsed as a float frac it returns int(math.Ceil(frac * float64(total))). It returns zero if term == "" or if a parsing error occures.

func ParseFloat

func ParseFloat(s string) float64

func SampleFirstN

func SampleFirstN(deck *[]int, samples *[]int, n int, nconstants int)

SampleFirstN ensures that the first n entries in the supplied deck are randomly drawn from all entries without replacement for use in selecting candidate features to split on. It accepts a pointer to the deck so that it can be used repeatedly on the same deck avoiding reallocations.

func SampleWithReplacment

func SampleWithReplacment(nSamples int, totalCases int) (cases []int)

SampleWithReplacment samples nSamples random draws from [0,totalCases) with replacement for use in selecting cases to grow a tree from.

func WriteArffCases

func WriteArffCases(data *FeatureMatrix, cases []int, relation string, outfile io.Writer) error

WriteArffCases writes the specified cases from the provied feature matrix into an arff file with the given relation string.

func WriteLibSvm

func WriteLibSvm(data *FeatureMatrix, targetn string, outfile io.Writer) error

func WriteLibSvmCases

func WriteLibSvmCases(data *FeatureMatrix, cases []int, targetn string, outfile io.Writer) error

Types

type AdaBoostTarget

type AdaBoostTarget struct {
	CatFeature
	Weights []float64
}

AdaBoostTarget wraps a numerical feature as a target for us in Adaptive Boosting (AdaBoost)

func NewAdaBoostTarget

func NewAdaBoostTarget(f CatFeature) (abt *AdaBoostTarget)

NewAdaBoostTarget creates a categorical adaptive boosting target and initializes its weights.

func (*AdaBoostTarget) Boost

func (t *AdaBoostTarget) Boost(leaves *[][]int) (weight float64)

Boost performs categorical adaptive boosting using the specified partition and returns the weight that tree that generated the partition should be given.

func (*AdaBoostTarget) ImpFromCounts

func (target *AdaBoostTarget) ImpFromCounts(cases *[]int, counter *[]int) (e float64)

ImpFromCounts recalculates gini impurity from class counts for us in intertive updates.

func (*AdaBoostTarget) Impurity

func (target *AdaBoostTarget) Impurity(cases *[]int, counter *[]int) (e float64)

Impurity is an AdaCosting that uses the weights specified in weights.

func (*AdaBoostTarget) SplitImpurity

func (target *AdaBoostTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

SplitImpurity is an AdaCosting version of SplitImpurity.

func (*AdaBoostTarget) UpdateSImpFromAllocs

func (target *AdaBoostTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type AdaCostTarget

type AdaCostTarget struct {
	CatFeature
	Weights []float64
	Costs   []float64
}

AdaCostTarget wraps a numerical feature as a target for us in Cost Sensitive Adaptive Boosting (AdaC2.M1)

"Boosting for Learning Multiple Classes with Imbalanced Class Distribution" Yanmin Sun, Mohamed S. Kamel and Yang Wang

See equations in slides here: http://people.ee.duke.edu/~lcarin/Minhua4.18.08.pdf

func NewAdaCostTarget

func NewAdaCostTarget(f CatFeature) (abt *AdaCostTarget)

NewAdaCostTarget creates a categorical adaptive boosting target and initializes its weights.

func (*AdaCostTarget) Boost

func (t *AdaCostTarget) Boost(leaves *[][]int) (weight float64)

Boost performs categorical adaptive boosting using the specified partition and returns the weight that tree that generated the partition should be given.

func (*AdaCostTarget) ImpFromCounts

func (target *AdaCostTarget) ImpFromCounts(cases *[]int, counter *[]int) (e float64)

ImpFromCounts recalculates gini impurity from class counts for us in intertive updates.

func (*AdaCostTarget) Impurity

func (target *AdaCostTarget) Impurity(cases *[]int, counter *[]int) (e float64)

Impurity is an AdaCosting that uses the weights specified in weights.

func (*AdaCostTarget) SetCosts

func (target *AdaCostTarget) SetCosts(costmap map[string]float64)

RegretTarget.SetCosts puts costs in a map[string]float64 by feature name into the proper entries in RegretTarget.Costs.

func (*AdaCostTarget) SplitImpurity

func (target *AdaCostTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

SplitImpurity is an AdaCosting version of SplitImpurity.

func (*AdaCostTarget) UpdateSImpFromAllocs

func (target *AdaCostTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type Bagger

type Bagger interface {
	Sample(samples *[]int, n int)
}

type BalancedSampler

type BalancedSampler struct {
	Cases [][]int
}

BalancedSampler provides for random sampelign of integers (usually case indexes) in a way that ensures a balanced presence of classes.

func NewBalancedSampler

func NewBalancedSampler(catf *DenseCatFeature) (s *BalancedSampler)

NeaBalancedSampler initalizes a balanced sampler that will evenly balance cases between the classes present in the provided DesnseeCatFeature.

func (*BalancedSampler) Sample

func (s *BalancedSampler) Sample(samples *[]int, n int)

Sample samples n integers in a balnced-with-replacment fashion into the provided array.

type BestSplitAllocs

type BestSplitAllocs struct {
	L              []int //Allocated to size
	R              []int
	M              []int
	LM             []int //Used to point at other array
	RM             []int
	MM             []int
	Left           *[]int  //left cases for potential splits
	Right          *[]int  //right cases for potential splits
	NonMissing     *[]int  //non missing cases for potential splits
	Counter        *[]int  //class counter for counting classes in splits used alone of for missing
	LCounter       *[]int  //left class counter sumarizing (mean) splits
	RCounter       *[]int  //right class counter sumarizing (mean) splits
	Lsum           float64 //left value for sumarizing splits
	Rsum           float64 //right value for sumarizing  splits
	Msum           float64 //missing value for sumarizing splits
	Lsum_sqr       float64 //left value for sumarizing splits
	Rsum_sqr       float64 //right value for sumarizing  splits
	Msum_sqr       float64 //missing value for sumarizing splits
	CatVals        []int
	SortVals       []float64
	Sorter         *SortableFeature //for learning from numerical features
	ContrastTarget Target
	Rnd            *rand.Rand //prevent contention on global rand source
}

BestSplitAllocs contains reusable allocations for split searching and evaluation. Seprate instances should be used in each go routing doing learning.

func NewBestSplitAllocs

func NewBestSplitAllocs(nTotalCases int, target Target) (bsa *BestSplitAllocs)

NewBestSplitAllocs initializes all of the reusable allocations for split searching to the appropriate size. nTotalCases should be number of total cases in the feature matrix being analyzed.

type BoostingTarget

type BoostingTarget interface {
	Target
	Boost(partition *[][]int, preds *[]string) (weight float64)
}

BoostingTarget augments Target with a "Boost" method that will be called after each tree is grown with the partion generated by that tree. It will return the weigh the tree should be given and boost the target for the next tree.

type CatBallot

type CatBallot struct {
	Mutex sync.Mutex
	Map   map[int]float64
}

CatBallot is used insideof CatBallotBox to record catagorical votes in a thread safe manner.

func NewCatBallot

func NewCatBallot() (cb *CatBallot)

NewCatBallot returns a pointer to an initalized CatBallot with a 0 size Map.

type CatBallotBox

type CatBallotBox struct {
	*CatMap
	Box []*CatBallot
}

CatBallotBox keeps track of votes by trees in a thread safe manner.

func NewCatBallotBox

func NewCatBallotBox(size int) *CatBallotBox

NewCatBallotBox builds a new ballot box for the number of cases specified by "size".

func (*CatBallotBox) Tally

func (bb *CatBallotBox) Tally(i int) (predicted string)

Tally tallies the votes for the case specified by i as if it is a Categorical or boolean feature. Ie it returns the mode (the most frequent value) of all votes.

func (*CatBallotBox) TallyError

func (bb *CatBallotBox) TallyError(feature Feature) (e float64)

TallyError returns the balanced classification error for categorical features.

1 - sum((sum(Y(xi)=Y'(xi))/|xi|))

where Y are the labels Y' are the estimated labels xi is the set of samples with the ith actual label

Case for which the true category is not known are ignored.

func (*CatBallotBox) Vote

func (bb *CatBallotBox) Vote(casei int, pred string, weight float64)

Vote registers a vote that case "casei" should be predicted to be the category "pred".

type CatFeature

type CatFeature interface {
	Feature
	CountPerCat(cases *[]int, counter *[]int)
	MoveCountsRtoL(allocs *BestSplitAllocs, movedRtoL *[]int)
	CatToNum(value string) (numericv int)
	NumToCat(i int) (value string)
	Geti(i int) int
	Puti(i int, v int)
	Modei(cases *[]int) int
	Mode(cases *[]int) string
	Gini(cases *[]int) float64
	GiniWithoutAlocate(cases *[]int, counts *[]int) (e float64)
	EncodeToNum() (fs []Feature)
	OneHot() (fs []Feature)
}

CatFeature contains the methods of Feature plus methods needed to implement diffrent types of classification. It is usually embeded by classification targets to provide access to the underlying data.

type CatMap

type CatMap struct {
	Map  map[string]int //map categories from string to Num
	Back []string       // map categories from Num to string
}

CatMap is for mapping categorical values to integers. It contains:

Map  : a map of ints by the string used for the category
Back : a slice of strings by the int that represents them

And is embedded by Feature and CatBallotBox.

func (*CatMap) CatToNum

func (cm *CatMap) CatToNum(value string) (numericv int)

CatToNum provides the int equivalent of the provided categorical value if it already exists or adds it to the map and returns the new value if it doesn't.

func (*CatMap) NCats

func (cm *CatMap) NCats() (n int)

NCats returns the number of distinct catagories.

func (*CatMap) NumToCat

func (cm *CatMap) NumToCat(i int) (value string)

NumToCat returns the catagory label that has been assigned i

type CodedRecursable

type CodedRecursable func(*Node, *[]int, int, int) (int, interface{}, int)

type DEntropyTarget

type DEntropyTarget struct {
	CatFeature
	Costs []float64
}

DEntropyTarget wraps a categorical feature for use in entropy driven classification as in Ross Quinlan's ID3 (Iterative Dichotomizer 3) with a the entropy modified to use "disutility entropy"

I = - k Sum ri * pi * log(pi)

func NewDEntropyTarget

func NewDEntropyTarget(f CatFeature) *DEntropyTarget

NewDEntropyTarget creates a RefretTarget and initializes DEntropyTarget.Costs to the proper length.

func (*DEntropyTarget) FindPredicted

func (target *DEntropyTarget) FindPredicted(cases []int) (pred string)

func (*DEntropyTarget) ImpFromCounts

func (target *DEntropyTarget) ImpFromCounts(total int, counts *[]int) (e float64)

func (*DEntropyTarget) Impurity

func (target *DEntropyTarget) Impurity(cases *[]int, counts *[]int) (e float64)

DEntropyTarget.Impurity implements categorical entropy as sum(pj*log2(pj)) where pj is the number of cases with the j'th category over the total number of cases.

func (*DEntropyTarget) SetCosts

func (target *DEntropyTarget) SetCosts(costmap map[string]float64)

NewDEntropyTarget.SetCosts puts costs in a map[string]float64 by feature name into the proper entries in NewDEntropyTarget.Costs.

func (*DEntropyTarget) SplitImpurity

func (target *DEntropyTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

DEntropyTarget.SplitImpurity is a version of Split Impurity that calls DEntropyTarget.Impurity

func (*DEntropyTarget) UpdateSImpFromAllocs

func (target *DEntropyTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type DenseCatFeature

type DenseCatFeature struct {
	*CatMap
	CatData      []int
	Missing      []bool
	Name         string
	RandomSearch bool
	HasMissing   bool
}

DenseCatFeature is a structure representing a single feature in a feature matrix. It contains: An embedded CatMap (may only be instantiated for cat data)

NumData   : A slice of floates used for numerical data and nil otherwise
CatData   : A slice of ints for categorical data and nil otherwise
Missing   : A slice of bools indicating missing values. Measure this for length.
Numerical : is the feature numerical
Name      : the name of the feature

func (*DenseCatFeature) Append

func (f *DenseCatFeature) Append(v string)

Append will parse and append a single value to the end of the feature. It is generally only used during data parseing.

func (*DenseCatFeature) BestBinSplit

func (f *DenseCatFeature) BestBinSplit(target Target,
	cases *[]int,
	parentImp float64,
	maxEx int,
	leafSize int,
	a *BestSplitAllocs) (bestSplit int, impurityDecrease float64, constant bool)

BestBinSplit performs an exhaustive search for the split that minimizes impurity in the specified target for categorical features with 2 categories. It expects to be provided for cases fir which the feature is not missing.

This implementation follows Brieman's implementation and the R/Matlab implementations based on it use exhaustive search for when there are less than 25/10 categories and random splits above that.

Searching is implemented via bitwise operations vs an incrementing or random int (32 bit) for speed but will currently only work when there are less then 31 categories. Use one of the Big functions above that.

The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*DenseCatFeature) BestCatSplit

func (f *DenseCatFeature) BestCatSplit(target Target,
	cases *[]int,
	parentImp float64,
	maxEx int,
	leafSize int,
	allocs *BestSplitAllocs) (bestSplit int, impurityDecrease float64, constant bool)

BestCatSplit performs an exhaustive search for the split that minimizes impurity in the specified target for categorical features with less then 31 categories. It expects to be provided for cases fir which the feature is not missing.

This implementation follows Brieman's implementation and the R/Matlab implementations based on it use exhaustive search for when there are less than 25/10 categories and random splits above that.

Searching is implemented via bitwise operations vs an incrementing or random int (32 bit) for speed but will currently only work when there are less then 31 categories. Use one of the Big functions above that.

The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*DenseCatFeature) BestCatSplitBig

func (f *DenseCatFeature) BestCatSplitBig(target Target, cases *[]int, parentImp float64, maxEx int, leafSize int, allocs *BestSplitAllocs) (bestSplit *big.Int, impurityDecrease float64, constant bool)

BestCatSplitBig performs a random/exhaustive search to find the split that minimizes impurity in the specified target. It expects to be provided for cases fir which the feature is not missing.

Searching is implemented via bitwise on Big.Ints to handle large n categorical features but BestCatSplit should be used for n <31.

The best split is returned as a BigInt for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*DenseCatFeature) BestCatSplitIter

func (f *DenseCatFeature) BestCatSplitIter(target Target, cases *[]int, parentImp float64, leafSize int, allocs *BestSplitAllocs) (bestSplit int, impurityDecrease float64, constant bool)

BestCatSplitIter performs an iterative search to find the split that minimizes impurity in the specified target. It expects to be provided for cases fir which the feature is not missing.

Searching is implemented via bitwise ops on ints (32 bit) for speed but will currently only work when there are <31 categories. Use BigInterBestCatSplit above that.

The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*DenseCatFeature) BestCatSplitIterBig

func (f *DenseCatFeature) BestCatSplitIterBig(target Target, cases *[]int, parentImp float64, leafSize int, allocs *BestSplitAllocs) (bestSplit *big.Int, impurityDecrease float64, constant bool)

BestCatSplitIterBig performs an iterative search to find the split that minimizes impurity in the specified target. It expects to be provided for cases fir which the feature is not missing.

Searching is implemented via bitwise on integers for speed but will currently only work when there are less categories then the number of bits in an int.

The best split is returned as an int for which the bits corresponding to categories that should be sent left has been flipped. This can be decoded into a splitter using DecodeSplit on the training feature and should not be applied to testing data without doing so as the order of categories may have changed.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*DenseCatFeature) BestSplit

func (f *DenseCatFeature) BestSplit(target Target,
	cases *[]int,
	parentImp float64,
	leafSize int,
	randomSplit bool,
	allocs *BestSplitAllocs) (codedSplit interface{}, impurityDecrease float64, constant bool)

BestSplit finds the best split of the features that can be achieved using the specified target and cases. It returns a Splitter and the decrease in impurity.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*DenseCatFeature) Copy

func (f *DenseCatFeature) Copy() Feature

Copy returns a copy of f.

func (*DenseCatFeature) CopyInTo

func (f *DenseCatFeature) CopyInTo(copyf Feature)

CopyInTo coppoes the values of the feature into a new feature...doesn't copy CatMap, name etc.

func (*DenseCatFeature) CountPerCat

func (target *DenseCatFeature) CountPerCat(cases *[]int, counts *[]int)

CountPerCat puts per catagory counts in the supplied counter. It is designed for use in a target and doesn't check for missing values.

func (*DenseCatFeature) DecodeSplit

func (f *DenseCatFeature) DecodeSplit(codedSplit interface{}) (s *Splitter)

DecodeSplit builds a splitter from the numeric values returned by BestNumSplit or BestCatSplit. Numeric splitters are decoded to send values <= num left. Categorical splitters are decoded to send categorical values for which the bit in cat is 1 left.

func (*DenseCatFeature) EncodeToNum

func (f *DenseCatFeature) EncodeToNum() (fs []Feature)

EncodeToNum returns numerical features doing simple binary encoding of Each of the distinct catagories in the feature.

func (*DenseCatFeature) FilterMissing

func (f *DenseCatFeature) FilterMissing(cases *[]int, filtered *[]int)

FilterMissing loops over the cases and appends them into filtered. For most use cases filtered should have zero length before you begin as it is not reset internally

func (*DenseCatFeature) FindPredicted

func (f *DenseCatFeature) FindPredicted(cases []int) (pred string)

FindPredicted takes the indexes of a set of cases and returns the predicted value. For categorical features this is a string containing the most common category and for numerical it is the mean of the values.

func (*DenseCatFeature) GetName

func (f *DenseCatFeature) GetName() string

GetName returns the name of the feature.

func (*DenseCatFeature) GetStr

func (f *DenseCatFeature) GetStr(i int) string

GetStr returns the class label for the i'th case.

func (*DenseCatFeature) Geti

func (f *DenseCatFeature) Geti(i int) int

Geti returns the int encoding of the class label for the i'th case.

func (*DenseCatFeature) Gini

func (target *DenseCatFeature) Gini(cases *[]int) (e float64)

Gini returns the gini impurity for the specified cases in the feature gini impurity is calculated as 1 - Sum(fi^2) where fi is the fraction of cases in the ith catagory.

func (*DenseCatFeature) GiniWithoutAlocate

func (target *DenseCatFeature) GiniWithoutAlocate(cases *[]int, counts *[]int) (e float64)

GiniWithoutAlocate calculates gini impurity using the supplied counter which must be a slice with length equal to the number of cases. This allows you to reduce allocations but the counter will also contain per category counts.

func (*DenseCatFeature) GoesLeft

func (f *DenseCatFeature) GoesLeft(i int, splitter *Splitter) bool

GoesLeft tests if the i'th case goes left according to the supplid Spliter.

func (*DenseCatFeature) ImpFromCounts

func (target *DenseCatFeature) ImpFromCounts(total int, counts *[]int) (e float64)

ImpFromCounts recalculates gini impurity from class counts for us in intertive updates.

func (*DenseCatFeature) Impurity

func (target *DenseCatFeature) Impurity(cases *[]int, counter *[]int) (e float64)

Impurity returns Gini impurity or mean squared error vs the mean for a set of cases depending on weather the feature is categorical or numerical

func (*DenseCatFeature) ImputeMissing

func (f *DenseCatFeature) ImputeMissing()

ImputeMissing imputes the missing values in a feature to the mean or mode of the feature.

func (*DenseCatFeature) IsMissing

func (f *DenseCatFeature) IsMissing(i int) bool

IsMissing returns weather the given case is missing in the feature.

func (*DenseCatFeature) Length

func (f *DenseCatFeature) Length() int

Length returns the number of cases in the feature.

func (*DenseCatFeature) MissingVals

func (f *DenseCatFeature) MissingVals() bool

MissingVals returns weather the feature has any missing values.

func (*DenseCatFeature) Mode

func (f *DenseCatFeature) Mode(cases *[]int) (m string)

Mode returns the mode category feature for the cases specified.

func (*DenseCatFeature) Modei

func (f *DenseCatFeature) Modei(cases *[]int) (m int)

Modei returns the mode category feature for the cases specified.

func (*DenseCatFeature) MoveCountsRtoL

func (target *DenseCatFeature) MoveCountsRtoL(allocs *BestSplitAllocs, movedRtoL *[]int)

MoveCoutsRtoL moves the by case counts from R to L for use in iterave updates.

func (*DenseCatFeature) OneHot

func (f *DenseCatFeature) OneHot() (fs []Feature)

func (*DenseCatFeature) PutMissing

func (f *DenseCatFeature) PutMissing(i int)

PutMissing sets the given case i to be missing.

func (*DenseCatFeature) PutStr

func (f *DenseCatFeature) PutStr(i int, v string)

PutStr set's the i'th case to the class label v.

func (*DenseCatFeature) Puti

func (f *DenseCatFeature) Puti(i int, v int)

Puti puts the v't class label encoding into the ith case.

func (*DenseCatFeature) Shuffle

func (f *DenseCatFeature) Shuffle()

Shuffle does an inflace shuffle of the specified feature

func (*DenseCatFeature) ShuffleCases

func (f *DenseCatFeature) ShuffleCases(cases *[]int, allocs *BestSplitAllocs)

ShuffleCases does an inplace shuffle of the specified cases

func (*DenseCatFeature) ShuffledCopy

func (f *DenseCatFeature) ShuffledCopy() Feature

ShuffledCopy returns a shuffled version of f for use as an artificial contrast in evaluation of importance scores. The new feature will be named featurename:SHUFFLED

func (*DenseCatFeature) Span

func (target *DenseCatFeature) Span(cases *[]int, counts *[]int) float64

Span counts the number of distincts cats present in the specified cases.

func (*DenseCatFeature) Split

func (f *DenseCatFeature) Split(codedSplit interface{}, cases []int) (l []int, r []int, m []int)

Split does an inplace split from a coded spli value which should be an int or Big.Int with bit flags representing which class labels go left.

func (*DenseCatFeature) SplitImpurity

func (target *DenseCatFeature) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

SplitImpurity calculates the impurity of a split into the specified left and right groups. This is defined as pLi*(tL)+pR*i(tR) where pL and pR are the probability of case going left or right and i(tl) i(tR) are the left and right impurities.

Counter is only used for categorical targets and should have the same length as the number of categories in the target.

func (*DenseCatFeature) SplitPoints

func (f *DenseCatFeature) SplitPoints(codedSplit interface{}, cs *[]int) (int, int)

SplitPoints reorders cs and returns the indexes at which left and right cases end and begin The codedSplit whould be an int or Big.Int with bits set to indicate which classes go left.

func (*DenseCatFeature) UpdateSImpFromAllocs

func (target *DenseCatFeature) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type DenseNumFeature

type DenseNumFeature struct {
	NumData    []float64
	Missing    []bool
	Name       string
	HasMissing bool
}

DenseNumFeature contains dense float64 training data, possibly with missing values.

func (*DenseNumFeature) Append

func (f *DenseNumFeature) Append(v string)

Append will parse and append a single value to the end of the feature. It is generally only used during data parseing.

func (*DenseNumFeature) BestNumSplit

func (f *DenseNumFeature) BestNumSplit(target Target,
	cases *[]int,
	parentImp float64,
	leafSize int,
	randomSplit bool,
	allocs *BestSplitAllocs) (codedSplit interface{}, impurityDecrease float64, constant bool)

BestNumSplit searches over the possible splits of cases that can be made with f and returns the one that minimizes the impurity of the target and the impurity decrease.

It expects to be provided cases for which the feature is not missing.

It searches by sorting the cases by the potential splitter and then evaluating each "gap" between cases with non equal value as a potential split.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*DenseNumFeature) BestSplit

func (f *DenseNumFeature) BestSplit(target Target,
	cases *[]int,
	parentImp float64,
	leafSize int,
	randomSplit bool,
	allocs *BestSplitAllocs) (codedSplit interface{}, impurityDecrease float64, constant bool)

BestSplit finds the best split of the features that can be achieved using the specified target and cases. It returns a Splitter and the decrease in impurity.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*DenseNumFeature) Copy

func (f *DenseNumFeature) Copy() Feature

Copy returns a copy of f.

func (*DenseNumFeature) CopyInTo

func (f *DenseNumFeature) CopyInTo(copyf Feature)

CopyInTo copies the values and missing state from one numerical feature into another.

func (*DenseNumFeature) DecodeSplit

func (f *DenseNumFeature) DecodeSplit(codedSplit interface{}) (s *Splitter)

Decode split builds a splitter from the numeric values returned by BestNumSplit or BestCatSplit. Numeric splitters are decoded to send values <= num left. Categorical splitters are decoded to send categorical values for which the bit in cat is 1 left.

func (*DenseNumFeature) Error

func (target *DenseNumFeature) Error(cases *[]int, predicted float64) (e float64)

Error returns the Mean Squared error of the cases specified vs the predicted value. Only non missing cases are considered.

func (*DenseNumFeature) FilterMissing

func (f *DenseNumFeature) FilterMissing(cases *[]int, filtered *[]int)

FilterMissing loops over the cases and appends them into filtered. For most use cases filtered should have zero length before you begin as it is not reset internally

func (*DenseNumFeature) FindPredicted

func (f *DenseNumFeature) FindPredicted(cases []int) (pred string)

Find predicted takes the indexes of a set of cases and returns the predicted value. For categorical features this is a string containing the most common category and for numerical it is the mean of the values.

func (*DenseNumFeature) Get

func (f *DenseNumFeature) Get(i int) float64

Get returns the value in the i'th posiiton. It doesn't check for missing values.

func (*DenseNumFeature) GetName

func (f *DenseNumFeature) GetName() string

GetName returns the name of the feature.

func (*DenseNumFeature) GetStr

func (f *DenseNumFeature) GetStr(i int) (value string)

Get str returns the string representing the value in the i'th position. It returns NA if tehe value is missing.

func (*DenseNumFeature) GoesLeft

func (f *DenseNumFeature) GoesLeft(i int, splitter *Splitter) bool

GoesLeft checks if the i'th case goes left according to the supplied spliter.

func (*DenseNumFeature) Impurity

func (target *DenseNumFeature) Impurity(cases *[]int, counter *[]int) (e float64)

Impurity returns Gini impurity or mean squared error vs the mean for a set of cases depending on weather the feature is categorical or numerical

func (*DenseNumFeature) ImputeMissing

func (f *DenseNumFeature) ImputeMissing()

ImputeMissing imputes the missing values in a feature to the mean or mode of the feature.

func (*DenseNumFeature) IsMissing

func (f *DenseNumFeature) IsMissing(i int) bool

IsMissing checks if the value for the i'th case is missing.

func (*DenseNumFeature) Length

func (f *DenseNumFeature) Length() int

Length returns the length of the feature.

func (*DenseNumFeature) Less

func (f *DenseNumFeature) Less(i int, j int) bool

Less checks if the value of case i is less then the value of j.

func (*DenseNumFeature) Mean

func (target *DenseNumFeature) Mean(cases *[]int) (m float64)

Mean returns the mean of the feature for the cases specified

func (*DenseNumFeature) MissingVals

func (f *DenseNumFeature) MissingVals() bool

MissingVals checks if the feature has any missing values.

func (*DenseNumFeature) Mode

func (f *DenseNumFeature) Mode(cases *[]int) (m float64)

Mode returns the mode category feature for the cases specified

func (*DenseNumFeature) NCats

func (f *DenseNumFeature) NCats() int

NCats returns the number of catagories, 0 for numerical values.

func (*DenseNumFeature) Norm

func (f *DenseNumFeature) Norm(i int, v float64) float64

Norm defines the norm to use to tell how far the i'th case if from the value v

func (*DenseNumFeature) Predicted

func (f *DenseNumFeature) Predicted(cases *[]int) float64

Predicted returns the prediction (the mean) that should be made for the supplied cases.

func (*DenseNumFeature) Put

func (f *DenseNumFeature) Put(i int, v float64)

Put inserts the value v into the i'th position of the feature.

func (*DenseNumFeature) PutMissing

func (f *DenseNumFeature) PutMissing(i int)

PutMissing set's the i'th value to be missing.

func (*DenseNumFeature) PutStr

func (f *DenseNumFeature) PutStr(i int, v string)

PutStr parses a string and puts it in the i'th position

func (*DenseNumFeature) Shuffle

func (f *DenseNumFeature) Shuffle()

Shuffle does an inplace shuffle of the specified feature

func (*DenseNumFeature) ShuffleCases

func (f *DenseNumFeature) ShuffleCases(cases *[]int, allocs *BestSplitAllocs)

ShuffleCases does an inplace shuffle of the specified cases

func (*DenseNumFeature) ShuffledCopy

func (f *DenseNumFeature) ShuffledCopy() Feature

ShuffledCopy returns a shuffled version of f for use as an artificial contrast in evaluation of importance scores. The new feature will be named featurename:SHUFFLED

func (*DenseNumFeature) Span

func (f *DenseNumFeature) Span(cases *[]int, counter *[]int) (span float64)

Span returns the lengh along the real line spaned by the specified cases

func (*DenseNumFeature) Split

func (f *DenseNumFeature) Split(codedSplit interface{}, cases []int) (l []int, r []int, m []int)

Split does an inplace slit from a coded split (a float64) and returns slices pointing into the origional cases slice.

func (*DenseNumFeature) SplitImpurity

func (target *DenseNumFeature) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

SplitImpurity calculates the impurity of a split into the specified left and right groups. This is defined as pLi*(tL)+pR*i(tR) where pL and pR are the probability of case going left or right and i(tl) i(tR) are the left and right impurities.

Counter is only used for categorical targets and should have the same length as the number of categories in the target.

func (*DenseNumFeature) SplitPoints

func (f *DenseNumFeature) SplitPoints(codedSplit interface{}, cs *[]int) (int, int)

SplitPoints returns the last left and first right index afeter reordering the cases slice froma float64 coded split.

func (*DenseNumFeature) SumAndSumSquares

func (target *DenseNumFeature) SumAndSumSquares(cases *[]int) (sum float64, sum_sqr float64)

func (*DenseNumFeature) UpdateSImpFromAllocs

func (target *DenseNumFeature) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type DensityTarget

type DensityTarget struct {
	Features *[]Feature
	N        int
}

DensityTarget is used for density estimating trees. It contains a set of features and the count of cases.

func (*DensityTarget) FindPredicted

func (target *DensityTarget) FindPredicted(cases []int) string

DensityTarget.FindPredicted returns the string representation of the density in the region spaned by the specified cases.

func (*DensityTarget) GetName

func (target *DensityTarget) GetName() string

func (*DensityTarget) Impurity

func (target *DensityTarget) Impurity(cases *[]int, counter *[]int) (e float64)

DensityTarget.Impurity uses the impurity measure defined in "Density Estimating Trees" by Parikshit Ram and Alexander G. Gray

func (*DensityTarget) NCats

func (target *DensityTarget) NCats() int

func (*DensityTarget) SplitImpurity

func (target *DensityTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

DensityTarget.SplitImpurity is a density estimating version of SplitImpurity.

func (*DensityTarget) UpdateSImpFromAllocs

func (target *DensityTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type EntropyTarget

type EntropyTarget struct {
	CatFeature
}

EntropyTarget wraps a categorical feature for use in entropy driven classification as in Ross Quinlan's ID3 (Iterative Dichotomizer 3).

func NewEntropyTarget

func NewEntropyTarget(f CatFeature) *EntropyTarget

NewEntropyTarget creates a RefretTarget and initializes EntropyTarget.Costs to the proper length.

func (*EntropyTarget) ImpFromCounts

func (target *EntropyTarget) ImpFromCounts(total int, counts *[]int) (e float64)

func (*EntropyTarget) Impurity

func (target *EntropyTarget) Impurity(cases *[]int, counts *[]int) (e float64)

EntropyTarget.Impurity implements categorical entropy as sum(pj*log2(pj)) where pj is the number of cases with the j'th category over the total number of cases.

func (*EntropyTarget) SplitImpurity

func (target *EntropyTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

EntropyTarget.SplitImpurity is a version of Split Impurity that calls EntropyTarget.Impurity

func (*EntropyTarget) UpdateSImpFromAllocs

func (target *EntropyTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type Feature

type Feature interface {
	Span(cases *[]int, counter *[]int) float64
	NCats() (n int)
	Length() (l int)
	GetStr(i int) (value string)
	IsMissing(i int) bool
	MissingVals() bool
	GoesLeft(i int, splitter *Splitter) bool
	PutMissing(i int)
	PutStr(i int, v string)
	SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
	UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
	Impurity(cases *[]int, counter *[]int) (impurity float64)
	FindPredicted(cases []int) (pred string)
	BestSplit(target Target,
		cases *[]int,
		parentImp float64,
		leafSize int,
		randomSplit bool,
		allocs *BestSplitAllocs) (codedSplit interface{}, impurityDecrease float64, constant bool)
	DecodeSplit(codedSplit interface{}) (s *Splitter)
	ShuffledCopy() (fake Feature)
	Copy() (copy Feature)
	CopyInTo(copy Feature)
	Shuffle()
	ShuffleCases(cases *[]int, allocs *BestSplitAllocs)
	ImputeMissing()
	GetName() string
	Append(v string)
	Split(codedSplit interface{}, cases []int) (l []int, r []int, m []int)
	SplitPoints(codedSplit interface{}, cases *[]int) (lastl int, firstr int)
}

Feature contains all methods needed for a predictor feature.

func ParseFeature

func ParseFeature(record []string) Feature

ParseFeature parses a Feature from an array of strings and a capacity capacity is the number of cases and will usually be len(record)-1 but but doesn't need to be calculated for every row of a large file. The type of the feature us inferred from the start of the first (header) field in record: "N:"" indicating numerical, anything else (usually "C:" and "B:") for categorical

type FeatureMatrix

type FeatureMatrix struct {
	Data       []Feature
	Map        map[string]int
	CaseLabels []string
}

FeatureMatrix contains a slice of Features and a Map to look of the index of a feature by its string id.

func LoadAFM

func LoadAFM(filename string) (fm *FeatureMatrix, err error)

LoadAFM loads a, possible zipped, FeatureMatrix specified by filename

func ParseAFM

func ParseAFM(input io.Reader) *FeatureMatrix

Parse an AFM (annotated feature matrix) out of an io.Reader AFM format is a tsv with row and column headers where the row headers start with N: indicating numerical, C: indicating categorical or B: indicating boolean For this parser features without N: are assumed to be categorical

func ParseARFF

func ParseARFF(input io.Reader) *FeatureMatrix

ParseARFF reads a file in weka'sarff format: http://www.cs.waikato.ac.nz/ml/weka/arff.html The relation is ignored and only catagorical and numerical variables are supported

func ParseLibSVM

func ParseLibSVM(input io.Reader) *FeatureMatrix

func (*FeatureMatrix) AddContrasts

func (fm *FeatureMatrix) AddContrasts(n int)

AddContrasts appends n artificial contrast features to a feature matrix. These features are generated by randomly selecting (with replacement) an existing feature and creating a shuffled copy named featurename:SHUFFLED.

These features can be used as a contrast to evaluate the importance score's assigned to actual features.

func (*FeatureMatrix) BestSplitter

func (fm *FeatureMatrix) BestSplitter(target Target,
	cases *[]int,
	candidates *[]int,
	mTry int,
	oob *[]int,
	leafSize int,
	force bool,
	vet bool,
	evaloob bool,
	extraRandom bool,
	allocs *BestSplitAllocs,
	nConstantsBefore int) (bestFi int, bestSplit interface{}, impurityDecrease float64, nConstants int)

BestSplitter finds the best splitter from a number of candidate features to slit on by looping over all features and calling BestSplit.

leafSize specifies the minimum leafSize that can be be produced by the split.

Vet specifies weather feature splits should be penalized with a randomized version of themselves.

allocs contains pointers to reusable structures for use while searching for the best split and should be initialized to the proper size with NewBestSplitAlocs.

func (*FeatureMatrix) ContrastAll

func (fm *FeatureMatrix) ContrastAll()

ContrastAll adds shuffled copies of every feature to the feature matrix. These features are generated by randomly selecting (with replacement) an existing feature and creating a shuffled copy named featurename:SHUFFLED.

These features can be used as a contrast to evaluate the importance score's assigned to actual features. ContrastAll is particularly useful vs AddContrast when one wishes to identify [pseudo] unique identifiers that might lead to over fitting.

func (*FeatureMatrix) EncodeToNum

func (fm *FeatureMatrix) EncodeToNum() *FeatureMatrix

func (*FeatureMatrix) ImputeMissing

func (fm *FeatureMatrix) ImputeMissing()

ImputeMissing imputes missing values in all features to the mean or mode of the feature.

func (*FeatureMatrix) LoadCases

func (fm *FeatureMatrix) LoadCases(data *csv.Reader, rowlabels bool)

LoadCases will load data stored case by case from a cvs reader into a feature matrix that has allready been filled with the coresponding empty features. It is a lower level method generally called after inital setup to parse a fm, arff, csv etc.

func (*FeatureMatrix) OneHot

func (fm *FeatureMatrix) OneHot() *FeatureMatrix

func (*FeatureMatrix) StripStrings

func (fm *FeatureMatrix) StripStrings(target string)

func (*FeatureMatrix) WriteCases

func (fm *FeatureMatrix) WriteCases(w io.Writer, cases []int) (err error)

WriteCases writes a new feature matrix with the specified cases to the the provided writer.

type Forest

type Forest struct {
	//Forest string
	Target    string
	Trees     []*Tree
	Intercept float64
}

Forest represents a collection of decision trees grown to predict Target.

func GrowRandomForest

func GrowRandomForest(fm *FeatureMatrix,
	target Target,
	candidates []int,
	nSamples int,
	mTry int,
	nTrees int,
	leafSize int,
	maxDepth int,
	splitmissing bool,
	force bool,
	vet bool,
	evaloob bool,
	importance *[]*RunningMean) (f *Forest)

type ForestReader

type ForestReader struct {
	// contains filtered or unexported fields
}

ForestReader wraps an io.Reader to reads a forest. It includes ReadForest for reading an entire forest or ReadTree for reading a forest tree by tree. The forest should be in .sf format see the package doc's in doc.go for full format details. It ignores fields that are not use by CloudForest.

func NewForestReader

func NewForestReader(r io.Reader) *ForestReader

NewForestReader wraps the supplied io.Reader as a ForestReader.

func (*ForestReader) ParseRfAcePredictorLine

func (fr *ForestReader) ParseRfAcePredictorLine(line string) map[string]string

ParseRfAcePredictorLine parses a single line of an rf-ace sf "stochastic forest" and returns a map[string]string of the key value pairs.

func (*ForestReader) ReadForest

func (fr *ForestReader) ReadForest() (forest *Forest, err error)

ForestReader.ReadForest reads the next forest from the underlying reader. If io.EOF or another error is encountered it returns that.

func (*ForestReader) ReadTree

func (fr *ForestReader) ReadTree() (tree *Tree, forest *Forest, err error)

ForestReader.ReadTree reads the next tree from the underlying reader. If the next tree is in a new forest it returns a forest object as well. If an io.EOF or other error is encountered it returns that as well as any partially parsed structs.

type ForestWriter

type ForestWriter struct {
	// contains filtered or unexported fields
}

ForestWriter wraps an io writer with functionality to write forests either with one call to WriteForest or incrementally using WriteForestHeader and WriteTree. ForestWriter saves a forest in .sf format; see the package doc's in doc.go for full format details. It won't include fields that are not use by CloudForest.

func NewForestWriter

func NewForestWriter(w io.Writer) *ForestWriter

NewForestWriter returns a pointer to a new ForestWriter.

func (*ForestWriter) DescribeMap

func (fw *ForestWriter) DescribeMap(input map[string]bool) string

DescribeMap serializes the "left" map of a categorical splitter.

func (*ForestWriter) WriteForest

func (fw *ForestWriter) WriteForest(forest *Forest)

WriteForest writes an entire forest including all headers.

func (*ForestWriter) WriteForestHeader

func (fw *ForestWriter) WriteForestHeader(nforest int, target string, intercept float64)

WrieTreeHeader writes only the header line for a tree.

func (*ForestWriter) WriteNode

func (fw *ForestWriter) WriteNode(n *Node, path string)

WriteNode writes a single node but not it's children. WriteTree will be used more often but WriteNode can be used to grow a large tree directly to disk without storing it in memory.

func (*ForestWriter) WriteNodeAndChildren

func (fw *ForestWriter) WriteNodeAndChildren(n *Node, path string)

WriteNodeAndChildren recursively writes out the target node and all of its children. WriteTree is preferred for most use cases.

func (*ForestWriter) WriteTree

func (fw *ForestWriter) WriteTree(tree *Tree, ntree int)

WriteTree writes an entire Tree including the header.

func (*ForestWriter) WriteTreeHeader

func (fw *ForestWriter) WriteTreeHeader(ntree int, target string, weight float64)

WrieTreeHeader writes only the header line for a tree.

type GradBoostClassTarget

type GradBoostClassTarget struct {
	*GradBoostTarget
	Actual    NumFeature
	Pred      NumFeature
	LearnRate float64
	Prior     float64
	Pos_class string
}

GradBoostClassTarget wraps a numerical feature as a target for us in Two Class Gradiant Boosting Trees.

It should be used with SumBallotBox and expit transformed to get class probabilities.

func NewGradBoostClassTarget

func NewGradBoostClassTarget(f CatFeature, learnrate float64, pos_class string) (gbc *GradBoostClassTarget)

func (*GradBoostClassTarget) Boost

func (f *GradBoostClassTarget) Boost(leaves *[][]int, preds *[]string) (weight float64)

BUG(ryan) does GradBoostingTarget need seperate residuals and values?

func (*GradBoostClassTarget) FindPredicted

func (f *GradBoostClassTarget) FindPredicted(cases []int) (pred string)

func (*GradBoostClassTarget) Intercept

func (f *GradBoostClassTarget) Intercept() float64

func (*GradBoostClassTarget) Predicted

func (f *GradBoostClassTarget) Predicted(cases *[]int) float64

func (*GradBoostClassTarget) Update

func (f *GradBoostClassTarget) Update(cases *[]int, predicted float64)

Update updates the underlying numeric data by subtracting the mean*weight of the specified cases from the value for those cases.

type GradBoostTarget

type GradBoostTarget struct {
	NumFeature
	LearnRate float64
	Mean      float64
}

GradBoostTarget wraps a numerical feature as a target for us in Gradiant Boosting Trees.

It should be used with the SumBallotBox.

func NewGradBoostTarget

func NewGradBoostTarget(f NumFeature, learnrate float64) (gbc *GradBoostTarget)

func (*GradBoostTarget) Boost

func (f *GradBoostTarget) Boost(leaves *[][]int, preds *[]string) (weight float64)

BUG(ryan) does GradBoostingTarget need seperate residuals and values?

func (*GradBoostTarget) Impurity

func (target *GradBoostTarget) Impurity(cases *[]int, counter *[]int) (e float64)

Impurity returns Gini impurity or mean squared error vs the mean for a set of cases depending on weather the feature is categorical or numerical

func (*GradBoostTarget) Intercept

func (f *GradBoostTarget) Intercept() float64

func (*GradBoostTarget) SplitImpurity

func (target *GradBoostTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

Friedman MSE slit improvment score from from equation 35 in "Greedy Function Approximation: A Gradiet Boosting Machine" Todo...what should the parent impurity be

func (*GradBoostTarget) Sum

func (target *GradBoostTarget) Sum(cases *[]int) (sum float64)

func (*GradBoostTarget) Update

func (f *GradBoostTarget) Update(cases *[]int, predicted float64)

Update updates the underlying numeric data by subtracting the mean*weight of the specified cases from the value for those cases.

func (*GradBoostTarget) UpdateSImpFromAllocs

func (target *GradBoostTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

type HDistanceTarget

type HDistanceTarget struct {
	CatFeature
	Pos_class string
}

HDistanceTarget wraps a categorical feature for use in Hellinger Distance tree growth.

func NewHDistanceTarget

func NewHDistanceTarget(f CatFeature, pos_class string) *HDistanceTarget

NewHDistanceTarget creates a RefretTarget and initializes HDistanceTarget.Costs to the proper length.

func (*HDistanceTarget) FindPredicted

func (target *HDistanceTarget) FindPredicted(cases []int) (pred string)

func (*HDistanceTarget) HDist

func (target *HDistanceTarget) HDist(lcounts *[]int, rcounts *[]int) (d float64)

func (*HDistanceTarget) Impurity

func (target *HDistanceTarget) Impurity(cases *[]int, counts *[]int) (e float64)

HDistanceTarget.Impurity

func (*HDistanceTarget) SplitImpurity

func (target *HDistanceTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) float64

HDistanceTarget.SplitImpurity is a version of Split Impurity that calls HDistanceTarget.Impurity

func (*HDistanceTarget) UpdateSImpFromAllocs

func (target *HDistanceTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) float64

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type L1Target

type L1Target struct {
	NumFeature
}

L1Target wraps a numerical feature as a target for us in l1 norm regression.

func (*L1Target) Error

func (target *L1Target) Error(cases *[]int, predicted float64) (e float64)

L1Target.MeanL1Error returns the Mean L1 norm error of the cases specified vs the predicted value. Only non missing cases are considered.

func (*L1Target) Impurity

func (target *L1Target) Impurity(cases *[]int, counter *[]int) (e float64)

L1Target.Impurity is an L1 version of impurity returning L1 instead of squared error.

func (*L1Target) SplitImpurity

func (target *L1Target) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

L1Target.SplitImpurity is an L1 version of SplitImpurity.

func (*L1Target) UpdateSImpFromAllocs

func (target *L1Target) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type Leaf

type Leaf struct {
	Cases []int
	Pred  string
}

Leaf is a struct for storing the index of the cases at a terminal "Leaf" node along with the Numeric predicted value.

type NPTarget

type NPTarget struct {
	CatFeature
	Posi  int
	Alpha float64
	Kappa float64
}

NPTarget wraps a categorical feature for use in experimental approximate Neyman-Pearson (NP) classification...constraints and optimization are done on percision false positive/negative rate.

It uses an impurity measure with a soft constraint from the seccond family presented in

"Comparison and Design of Neyman-Pearson Classifiers" Clayton Scott, October 2005

http://www.stat.rice.edu/~cscott/pubs/npdesign.pdf

N(f) = κ max((R0(f) − α), 0) + R1(f)

Where f is the classifer, R0 is the flase positive rate R1 is the false negative rate, α is the false positive constraint and k controls the cost of violating this constraint and β is a constant we can ignore as it subtracts out in diffrences.

The vote assigned to each leaf node is a corrected mode where the count of the positive/constrained label is corrected by 1/α. Without this modification constraints > .5 won't work since nodes with that many negatives false positives won't vote positive.

func NewNPTarget

func NewNPTarget(f CatFeature, Pos string, Alpha, Kappa float64) *NPTarget

NewNPTarget wraps a Categorical Feature for NP Classification. It accepts a string representing the contstrained label and floats Alpha and Kappa representing the constraint and constraint weight.

func (*NPTarget) FindPredicted

func (target *NPTarget) FindPredicted(cases []int) (pred string)

FindPredicted does a mode calulation with the count of the positive/constrained class corrected.

func (*NPTarget) ImpFromCounts

func (target *NPTarget) ImpFromCounts(t int, counter *[]int) (e float64)

ImpFromCounts recalculates gini impurity from class counts for us in intertive updates.

func (*NPTarget) Impurity

func (target *NPTarget) Impurity(cases *[]int, counter *[]int) (e float64)

NPTarget.Impurity implements an impurity that minimizes false negatives subject to a soft constrain on fale positives.

func (*NPTarget) SplitImpurity

func (target *NPTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

SplitImpurity is a version of Split Impurity that calls NPTarget.Impurity

func (*NPTarget) UpdateSImpFromAllocs

func (target *NPTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l to avoid recalulatign the entire split impurity.

type Node

type Node struct {
	CodedSplit interface{}
	Featurei   int
	Left       *Node
	Right      *Node
	Missing    *Node
	Pred       string
	Splitter   *Splitter
}

A node of a decision tree. Pred is a string containing either the category or a representation of a float (less then ideal)

func (*Node) Climb

func (n *Node) Climb(c func(*Node))

vist each child node with the supplied function

func (*Node) CodedRecurse

func (n *Node) CodedRecurse(r CodedRecursable, fm *FeatureMatrix, cases *[]int, depth int, nconstantsbefore int)

func (*Node) Recurse

func (n *Node) Recurse(r Recursable, fm *FeatureMatrix, cases []int, depth int)

Recurse is used to apply a Recursable function at every downstream node as the cases specified by case []int are split using the data in fm *Featurematrix. Recursion down a branch stops when a a node with n.Splitter == nil is reached. Recursion down the Missing branch is only used if n.Missing!=nil. For example votes can be tabulated using code like:

t.Root.Recurse(func(n *Node, cases []int) {
	if n.Left == nil && n.Right == nil {
		// I'm in a leaf node
		for i := 0; i < len(cases); i++ {
			bb.Vote(cases[i], n.Pred)
		}
	}
}, fm, cases)

type NumAdaBoostTarget

type NumAdaBoostTarget struct {
	NumFeature
	Weights    []float64
	NormFactor float64
}

NumNumAdaBoostTarget wraps a numerical feature as a target for us in (Experimental) Adaptive Boosting Regression.

func NewNumAdaBoostTarget

func NewNumAdaBoostTarget(f NumFeature) (abt *NumAdaBoostTarget)

func (*NumAdaBoostTarget) Boost

func (t *NumAdaBoostTarget) Boost(leaves *[][]int) (weight float64)

AdaBoostTarget.Boost performs numerical adaptive boosting using the specified partition and returns the weight that tree that generated the partition should be given. Trees with error greater then the impurity of the total feature (NormFactor) times the number of partions are given zero weight. Other trees have tree weight set to:

weight = math.Log(1 / norm)

and weights updated to:

t.Weights[c] = t.Weights[c] * math.Exp(t.Error(&[]int{c}, m)*weight)

These functions are chosen to provide a rough analog to catagorical adaptive boosting for numerical data with unbounded error.

func (*NumAdaBoostTarget) Impurity

func (target *NumAdaBoostTarget) Impurity(cases *[]int, counter *[]int) (e float64)

NumAdaBoostTarget.Impurity is an AdaBoosting that uses the weights specified in NumAdaBoostTarget.weights.

func (*NumAdaBoostTarget) SplitImpurity

func (target *NumAdaBoostTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

NumAdaBoostTarget.SplitImpurity is an AdaBoosting version of SplitImpurity.

func (*NumAdaBoostTarget) UpdateSImpFromAllocs

func (target *NumAdaBoostTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type NumBallotBox

type NumBallotBox struct {
	// contains filtered or unexported fields
}

Keeps track of votes by trees. Voteing is thread safe.

func NewNumBallotBox

func NewNumBallotBox(size int) *NumBallotBox

Build a new ballot box for the number of cases specified by "size".

func (*NumBallotBox) Tally

func (bb *NumBallotBox) Tally(i int) (predicted string)

func (*NumBallotBox) TallyError

func (bb *NumBallotBox) TallyError(feature Feature) (e float64)

TallyScore returns the squared error (unexplained variance) divided by the data variance.

func (*NumBallotBox) TallyNum

func (bb *NumBallotBox) TallyNum(i int) (predicted float64)

TallyNumerical tallies the votes for the case specified by i as if it is a Numerical feature. Ie it returns the mean of all votes.

func (*NumBallotBox) TallyR2Score

func (bb *NumBallotBox) TallyR2Score(feature Feature) (e float64)

Tally score returns the R2 score or coefichent of determination.

func (*NumBallotBox) TallySquaredError

func (bb *NumBallotBox) TallySquaredError(feature Feature) (e float64)

TallySquareError returns the error of the votes vs the provided feature. For categorical features it returns the error rate For numerical features it returns mean squared error. The provided feature must use the same index as the feature matrix the ballot box was constructed with. Missing values are ignored. Gini impurity is not used so this is not for use in rf implementations.

func (*NumBallotBox) Vote

func (bb *NumBallotBox) Vote(casei int, pred string, weight float64)

Vote parses the float in the string and votes for it

type NumFeature

type NumFeature interface {
	Feature
	Get(i int) float64
	Put(i int, v float64)
	Predicted(cases *[]int) float64
	Mean(cases *[]int) float64
	Norm(i int, v float64) float64
	Error(cases *[]int, predicted float64) (e float64)
	Less(i int, j int) bool
	SumAndSumSquares(cases *[]int) (float64, float64)
}

NumFeature contains the methods of Feature plus methods needed to implement diffrent types of regression. It is usually embeded by regression targets to provide access to the underlying data.

type OrdinalTarget

type OrdinalTarget struct {
	NumFeature
	// contains filtered or unexported fields
}

OrdinalTarget wraps a numerical feature as a target for us in ordinal regression. Data should be represented as positive integers and the Error is embeded from the embeded NumFeature.

func NewOrdinalTarget

func NewOrdinalTarget(f NumFeature) (abt *OrdinalTarget)

NewOrdinalTarget creates a categorical adaptive boosting target and initializes its weights.

func (*OrdinalTarget) FindPredicted

func (target *OrdinalTarget) FindPredicted(cases []int) (pred string)

func (*OrdinalTarget) Impurity

func (target *OrdinalTarget) Impurity(cases *[]int, counter *[]int) (e float64)

OrdinalTarget.Impurity is an ordinal version of impurity using Mode instead of Mean for prediction.

func (*OrdinalTarget) Mode

func (f *OrdinalTarget) Mode(cases *[]int) (m float64)

func (*OrdinalTarget) Predicted

func (f *OrdinalTarget) Predicted(cases *[]int) float64

func (*OrdinalTarget) SplitImpurity

func (target *OrdinalTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

OrdinalTarget.SplitImpurity is an ordinal version of SplitImpurity.

func (*OrdinalTarget) UpdateSImpFromAllocs

func (target *OrdinalTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type Recursable

type Recursable func(*Node, []int, int)

Recursable defines a function signature for functions that can be called at every down stream node of a tree as Node.Recurse recurses up the tree. The function should have two parameters, the current node and an array of ints specifying the cases that have not been split away.

type RegretTarget

type RegretTarget struct {
	CatFeature
	Costs []float64
}

RegretTarget wraps a categorical feature for use in regret driven classification. The ith entry in costs should contain the cost of misclassifying a case that actually has the ith category.

It is roughly equivelent to the ideas presented in:

http://machinelearning.wustl.edu/mlpapers/paper_files/icml2004_LingYWZ04.pdf

"Decision Trees with Minimal Costs" Charles X. Ling,Qiang Yang,Jianning Wang,Shichao Zhang

func NewRegretTarget

func NewRegretTarget(f CatFeature) *RegretTarget

NewRegretTarget creates a RefretTarget and initializes RegretTarget.Costs to the proper length.

func (*RegretTarget) FindPredicted

func (target *RegretTarget) FindPredicted(cases []int) (pred string)

FindPredicted does a mode calulation with the count of the positive/constrained class corrected.

func (*RegretTarget) ImpFromCounts

func (target *RegretTarget) ImpFromCounts(t int, counter *[]int) (e float64)

ImpFromCounts recalculates gini impurity from class counts for us in intertive updates.

func (*RegretTarget) Impurity

func (target *RegretTarget) Impurity(cases *[]int, counter *[]int) (e float64)

Impurity implements an impurity based on misslassification costs.

func (*RegretTarget) SetCosts

func (target *RegretTarget) SetCosts(costmap map[string]float64)

RegretTarget.SetCosts puts costs in a map[string]float64 by feature name into the proper entries in RegretTarget.Costs.

func (*RegretTarget) SplitImpurity

func (target *RegretTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

RegretTarget.SplitImpurity is a version of Split Impurity that calls RegretTarget.Impurity

func (*RegretTarget) UpdateSImpFromAllocs

func (target *RegretTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l to avoid recalulatign the entire split impurity.

type RunningMean

type RunningMean struct {
	Mean  float64
	Count float64
	// contains filtered or unexported fields
}

RunningMean is a thread safe strut for keeping track of running means as used in importance calculations. (TODO: could this be made lock free?)

func (*RunningMean) Add

func (rm *RunningMean) Add(val float64)

Add add's 1.0 to the running mean in a thread safe way.

func (*RunningMean) Read

func (rm *RunningMean) Read() (mean float64, count float64)

Read reads the mean and count

func (*RunningMean) WeightedAdd

func (rm *RunningMean) WeightedAdd(val float64, weight float64)

WeightedAdd add's the specified value to the running mean in a thread safe way.

type ScikitNode

type ScikitNode struct {
	LeftChild            int     `json:"left_child"`
	RightChild           int     `json:"right_child"`
	Feature              int     `json:"feature"`
	Threshold            float64 `json:"threshold"`
	Impurity             float64 `json:"impurity"`                //TODO(ryan): support this?
	NNodeSamples         int     `json:"n_node_samples"`          //TODO(ryan): support this?
	WeightedNNodeSamples float64 `json:"weighted_n_node_samples"` //TODO(ryan): support this?
}

type ScikitTree

type ScikitTree struct {
	NFeatures   int           `json:"n_features"`
	NClasses    []int         `json:"n_classes"`
	NOutputs    int           `json:"n_outputs"`     //TODO(ryan): support other values
	MaxNClasses int           `json:"max_n_classes"` //TODO(ryan): support other values
	MaxDepth    int           `json:"max_depth"`
	NodeCount   int           `json:"node_count"`
	Capacity    int           `json:"capacity"`
	Nodes       []ScikitNode  `json:"nodes"`
	Value       [][][]float64 `json:"value"` //TODO(ryan): support actual values
	ValueStride int           `json:"value_stride"`
}

# Inner structures: values are stored separately from node structure, # since size is determined at runtime. cdef public SIZE_t max_depth # Max depth of the tree cdef public SIZE_t node_count # Counter for node IDs cdef public SIZE_t capacity # Capacity of tree, in terms of nodes cdef Node* nodes # Array of nodes cdef double* value # (capacity, n_outputs, max_n_classes) array of values cdef SIZE_t value_stride # = n_outputs * max_n_classes

func NewScikitTree

func NewScikitTree(nFeatures int) *ScikitTree

type SecondaryBalancedSampler

type SecondaryBalancedSampler struct {
	Total    int
	Counts   []int
	Samplers [][][]int
}

SecondaryBalancedSampler roughly balances the target feature within the classes of another catagorical feature while roughly preserving the origional rate of the secondary feature.

func NewSecondaryBalancedSampler

func NewSecondaryBalancedSampler(target *DenseCatFeature, balanceby *DenseCatFeature) (s *SecondaryBalancedSampler)

NewSecondaryBalancedSampler returns an initalized balanced sampler.

func (*SecondaryBalancedSampler) Sample

func (s *SecondaryBalancedSampler) Sample(samples *[]int, n int)

type SortableFeature

type SortableFeature struct {
	//Feature NumFeature
	Vals  []float64
	Cases []int
}

SortableFeature is a wrapper for a feature and set of cases that satisfies the sort.Interface interface so that the case indexes in Cases can be sorted using sort.Sort

func (*SortableFeature) Len

func (sf *SortableFeature) Len() int

Len returns the number of cases.

func (*SortableFeature) Less

func (sf *SortableFeature) Less(i int, j int) bool

Less determines if the ith case is less then the jth case.

func (*SortableFeature) Load

func (sf *SortableFeature) Load(vals *[]float64, cases *[]int)

Load loads the values of the cases into a cache friendly array.

func (*SortableFeature) Sort

func (sf *SortableFeature) Sort()

Sort performs introsort + heapsort using the sortby sub package.

func (*SortableFeature) Swap

func (sf *SortableFeature) Swap(i int, j int)

Swap exchanges the ith and jth cases.

type SparseCounter

type SparseCounter struct {
	Map map[int]map[int]int
	// contains filtered or unexported fields
}

SparseCounter uses maps to track sparse integer counts in large matrix. The matrix is assumed to contain zero values where nothing has been added.

func (*SparseCounter) Add

func (sc *SparseCounter) Add(i int, j int, val int)

Add increases the count in i,j by val.

func (*SparseCounter) WriteTsv

func (sc *SparseCounter) WriteTsv(writer io.Writer)

WriteTsv writes the non zero counts out into a three column tsv containing i, j, and count in the columns.

type Splitter

type Splitter struct {
	Feature   string
	Numerical bool
	Value     float64
	Left      map[string]bool
}

Splitter contains fields that can be used to cases by a single feature. The split can be either numerical in which case it is defined by the Value field or categorical in which case it is defined by the Left and Right fields.

func (*Splitter) Split

func (s *Splitter) Split(fm *FeatureMatrix, cases []int) (l []int, r []int, m []int)

Split splits a slice of cases into left, right and missing slices without allocating a new underlying array by sorting cases into left, missing, right order and returning slices that point to the left and right cases.

type SumBallot

type SumBallot struct {
	Mutex sync.Mutex
	Sum   float64
}

SumBallot is used insideof SumBallotBox to record sum votes in a thread safe manner.

func NewSumBallot

func NewSumBallot() (cb *SumBallot)

NewSumBallot returns a pointer to an initalized SumBallot with a 0 size Map.

type SumBallotBox

type SumBallotBox struct {
	Box []*SumBallot
}

SumBallotBox keeps track of votes by trees in a thread safe manner. It should be used with gradient boosting when a sum instead of an average or mode is desired.

func NewSumBallotBox

func NewSumBallotBox(size int) *SumBallotBox

NewSumBallotBox builds a new ballot box for the number of cases specified by "size".

func (*SumBallotBox) Tally

func (bb *SumBallotBox) Tally(i int) (predicted string)

Tally tallies the votes for the case specified by i as if it is a Categorical or boolean feature. Ie it returns the sum of all votes.

func (*SumBallotBox) TallyError

func (bb *SumBallotBox) TallyError(feature Feature) (e float64)

TallyError is non functional here.

func (*SumBallotBox) TallyNum

func (bb *SumBallotBox) TallyNum(i int) (predicted float64)

func (*SumBallotBox) Vote

func (bb *SumBallotBox) Vote(casei int, pred string, weight float64)

Vote registers a vote that case "casei" should have pred added to its sum.

type Target

type Target interface {
	GetName() string
	NCats() (n int)
	SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)
	UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)
	Impurity(cases *[]int, counter *[]int) (impurity float64)
	FindPredicted(cases []int) (pred string)
}

Target abstracts the methods needed for a feature to be predictable as either a catagroical or numerical feature in a random forest.

type TargetWithIntercept

type TargetWithIntercept interface {
	Target
	Intercept() float64
}

type TransTarget

type TransTarget struct {
	CatFeature
	Features  *[]Feature
	Unlabeled int
	Alpha     float64
	Beta      float64
	N         int
	MaxCats   int
}

TransTarget is used for semi supervised transduction trees [1] that balance compine supervised impurity with a purelly density based term.

I = I_supervised + alpha * I_unsupervised

I_supervised is called from the embeded CatFeature so that it can be Gini, Entropy, Weighted or any other of the existing non-boosting impurities. Boosting impurities could be implemented with minimal work.

I_unsupervised uses a density estimating term that differs from the one described in [1] and is instead based on the technique described in [2] which avoids some assumptions and allows a simple implementation.

[1] A. Criminisi, J. Shotton, and E. Konukoglu, "Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning" Microsoft Research technical report TR-2011-114

[2] Parikshit Ram, Alexander G. Gray, Density Estimation Trees http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf

One diffrence from [1] is that the unlabelled class is considered a standard class for I_supervised to allow once class problems.

func NewTransTarget

func NewTransTarget(t CatFeature, fm *[]Feature, unlabeled string, alpha, beta float64, ncases int) *TransTarget

NewTransTarget returns a TransTarget using the supervised Impurity from the provided CatFeature t, Density in the specified Features fm (excluding any with the same name as t), considering the class label provided in "unlabeled" as unlabeled for transduction. Alpha is the weight of the unspervised term relative to the supervised and ncases is the number of cases that will be called at the root of the tree (may be depreciated as not needed).

func (*TransTarget) Density

func (target *TransTarget) Density(cases *[]int, counter *[]int) (e float64)

TransTarget.Density uses an impurity designed to maximize the density within each side of the split based on the method in "Density Estimating Trees" by Parikshit Ram and Alexander G. Gray. It loops over all of the non target features and for the ones with non zero span calculates product(span_i)/(t*t) where t is the number of cases.

Refinements to this method might include t*t->t^n where n is the number of features with non zero span or other changes to how zero span features are handeled. I also suspect that this method handles numerical features for which diffrent splits will have diffrent total spans based on the distance between the points on either side of the split point better then categorical features for which the total span of a split will allways be the number of categories.

The origional paper also included N which is not used here.

func (*TransTarget) FindPredicted

func (target *TransTarget) FindPredicted(cases []int) string

TransTarget.FindPredicted returns the prediction of the specified cases which is the majority class that is not the unlabeled class. A set of cases will only be predicted to be the ulabeled class if has no labeled points.

func (*TransTarget) Impurity

func (target *TransTarget) Impurity(cases *[]int, counter *[]int) (e float64)

func (*TransTarget) SplitImpurity

func (target *TransTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

TransTarget.SplitImpurity is a density estimating version of SplitImpurity.

func (*TransTarget) UpdateSImpFromAllocs

func (target *TransTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l as in learning from numerical variables. Here it just wraps SplitImpurity but it can be implemented to provide further optimization.

type Tree

type Tree struct {
	//Tree int
	Root   *Node
	Target string
	Weight float64
}

Tree represents a single decision tree.

func NewTree

func NewTree() *Tree

NewTree initializes one node tree.

func (*Tree) AddNode

func (t *Tree) AddNode(path string, pred string, splitter *Splitter)

AddNode adds a node a the specified path with the specified pred value and/or Splitter. Paths are specified in the same format as in rf-aces sf files, as a string of 'L' and 'R'. Nodes must be added from the root up as the case where the path specifies a node whose parent does not already exist in the tree is not handled well.

func (*Tree) GetLeaves

func (t *Tree) GetLeaves(fm *FeatureMatrix, fbycase *SparseCounter) []Leaf

GetLeaves is called by the leaf count utility to gather statistics about the nodes of a tree including the sets of cases at "leaf" nodes that aren't split further and the number of times each feature is used to split away each case.

func (*Tree) Grow

func (t *Tree) Grow(fm *FeatureMatrix,
	target Target,
	cases []int,
	candidates []int,
	oob []int,
	mTry int,
	leafSize int,
	maxDepth int,
	splitmissing bool,
	force bool,
	vet bool,
	evaloob bool,
	extraRandom bool,
	importance *[]*RunningMean,
	depthUsed *[]int,
	allocs *BestSplitAllocs)

Grow grows the receiver tree through recursion. It uses impurity decrease to select splitters at each node as in Brieman's Random Forest. It should be called on a tree with only a root node defined.

fm is a feature matrix of training data.

target is the feature to predict via regression or classification as determined by feature type.

cases specifies the cases to calculate impurity decrease over and can contain repeated values to allow for sampling of cases with replacement as in RF.

canidates specifies the potential features to use as splitters

mTry specifies the number of candidate features to evaluate for each split.

leafSize specifies the minimum number of cases at a leafNode.

splitmissing indicates if missing values should be split onto a third branch

vet indicates if splits should be penalized against a randomized version of them selves

func (*Tree) GrowJungle

func (t *Tree) GrowJungle(fm *FeatureMatrix,
	target Target,
	cases []int,
	candidates []int,
	oob []int,
	mTry int,
	leafSize int,
	maxDepth int,
	splitmissing bool,
	force bool,
	vet bool,
	evaloob bool,
	extraRandom bool,
	importance *[]*RunningMean,
	depthUsed *[]int,
	allocs *BestSplitAllocs)

func (*Tree) Partition

func (t *Tree) Partition(fm *FeatureMatrix) (*[][]int, *[]string)

Partition partitions all of the cases in a FeatureMatrix.

func (*Tree) StripCodes

func (t *Tree) StripCodes()

StripCodes removes all of the coded splits from a tree so that it can be used on new catagorical data.

func (*Tree) Vote

func (t *Tree) Vote(fm *FeatureMatrix, bb VoteTallyer)

Vote casts a vote for the predicted value of each case in fm *FeatureMatrix. into bb *BallotBox. Since BallotBox is not thread safe trees should not vote into the same BallotBox in parallel.

func (*Tree) VoteCases

func (t *Tree) VoteCases(fm *FeatureMatrix, bb VoteTallyer, cases []int)

VoteCases casts a vote for the predicted value of each case in fm *FeatureMatrix. into bb *BallotBox. Since BallotBox is not thread safe trees should not vote into the same BallotBox in parallel.

type VoteTallyer

type VoteTallyer interface {
	Vote(casei int, pred string, weight float64)
	TallyError(feature Feature) float64
	Tally(casei int) string
}

VoteTallyer is used to tabulate votes by trees and is implemented by feature type specific structs like NumBallotBox and CatBallotBox. Vote should register a cote that casei should be predicted as pred. TallyError returns the error vs the supplied feature.

type WRFTarget

type WRFTarget struct {
	CatFeature
	Weights []float64
}

WRFTarget wraps a numerical feature as a target for us weigted random forest.

func NewWRFTarget

func NewWRFTarget(f CatFeature, weights map[string]float64) (abt *WRFTarget)

NewWRFTarget creates a weighted random forest target and initializes its weights.

func (*WRFTarget) FindPredicted

func (target *WRFTarget) FindPredicted(cases []int) (pred string)

FindPredicted finds the predicted target as the weighted catagorical Mode.

func (*WRFTarget) ImpFromCounts

func (target *WRFTarget) ImpFromCounts(counter *[]int) (e float64)

ImpFromCounts recalculates gini impurity from class counts for us in intertive updates.

func (*WRFTarget) Impurity

func (target *WRFTarget) Impurity(cases *[]int, counter *[]int) (e float64)

Impurity is Gini impurity that uses the weights specified in WRFTarget.weights.

func (*WRFTarget) SplitImpurity

func (target *WRFTarget) SplitImpurity(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs) (impurityDecrease float64)

SplitImpurity is an weigtedRF version of SplitImpurity.

func (*WRFTarget) UpdateSImpFromAllocs

func (target *WRFTarget) UpdateSImpFromAllocs(l *[]int, r *[]int, m *[]int, allocs *BestSplitAllocs, movedRtoL *[]int) (impurityDecrease float64)

UpdateSImpFromAllocs willl be called when splits are being built by moving cases from r to l to avoid recalulatign the entire split impurity.

Notes

Bugs

  • does GradBoostingTarget need seperate residuals and values?

  • does GradBoostingTarget need seperate residuals and values?

Directories

Path Synopsis
Package sortby is a hybrid, non stable sort based on go's standard sort but with all less function and many swaps inlined to sort a list of ints by an acompanying list of floats as needed in random forest training.
Package sortby is a hybrid, non stable sort based on go's standard sort but with all less function and many swaps inlined to sort a list of ints by an acompanying list of floats as needed in random forest training.
Package stats currentelly only implements a welch's t-test for importance score analysis in CloudForest.
Package stats currentelly only implements a welch's t-test for importance score analysis in CloudForest.
utils

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL