vft

package module
v0.0.0-...-252af9d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 21, 2023 License: MIT Imports: 20 Imported by: 0

README

vft: virtual feature trees

vft (Virtual Feature Trees): gradient boosted decision trees (ala Xgboost) that allow (lazy) virtual features

Status: Incomplete. Work in Progress.

  • We started on porting cortado from python to Go.

  • The cortado project is Adam's GBDT classifier in python that utilizes lazy sequences and the numba for jit compiler for numeric python code.

  • I (Jason) did an initial attempt at porting cortado to Go. That work is here. It compiles and runs, but since we get a different answer for the AUROC on the airline training data set, I'm not sure that they two implementations are the same. The critical test case is this one:

~~ go test -v -run Test005_airline_data_test ~~

  • More lower-level unit testing of both the python and the Go implementations is required; expecting to see identical computations.

  • Since I'm not clear on some of the algorithm details (like how split thresholds are established; how the node splitting logic works), I cannot offer any statement about the correctness of the Go port as of yet.

  • Most parts need documenting and further, lower level, test coverage.


Copyright (C) 2023 by Jason E. Aten and Adam Mlocek.

Licence: MIT

Documentation

Index

Constants

View Source
const MISSINGLEVEL string = "."

match the reference implement for a missing factor level; always level == 0

View Source
const RFC3339MsecTz0 = "2006-01-02T15:04:05.000Z07:00"

Variables

View Source
var Chicago *time.Location
View Source
var ErrNoData = fmt.Errorf("no data")
View Source
var ErrNoHeader = fmt.Errorf("no header")
View Source
var ErrRowNotFound = fmt.Errorf("row not found")
View Source
var OurStdout io.Writer = os.Stdout

so we can multi write easily, use our own printf

View Source
var VerboseVerbose bool = false

for tons of debug output

Functions

func AlwaysPrintf

func AlwaysPrintf(format string, a ...interface{})

func Caller

func Caller(upStack int) string

func CountLines

func CountLines(fd *os.File) (nline int, mmap []byte)

memory mapped counting of newlines: very fast even on a single core, because it uses bytes.Count().

func CsvShowMain

func CsvShowMain()

func DirExists

func DirExists(name string) bool

func FileExists

func FileExists(name string) bool

func FileLine

func FileLine(depth int) string

func FileSize

func FileSize(name string) (int64, error)

func Main

func Main()

func MemoryMapFile

func MemoryMapFile(fd *os.File) (mmap []byte)

func NewLevelPartitionAllTrue

func NewLevelPartitionAllTrue(n int) []bool

func PP

func PP(format string, a ...interface{})

func Printf

func Printf(format string, a ...interface{}) (n int, err error)

Printf formats according to a format specifier and writes to standard output. It returns the number of bytes written and any write error encountered.

func R1

func R1(pred1, pred0, power, margin float64) float64

R1 is the minimization inner objective function (equation 7) from "Optimizing Classifier Performance via an Approximation

to the Wilcoxon-Mann-Whitney Statistic"

by Yan, Dodier, Mozer and Wolniewicz, Proceedings of ICML 2003.

It is designed to be differentiable without too much difficulty.

pred1 = a prediction for a known target of 1 pred0 = a prediction for a known target of 0

The two meta-parameters for the optimization are margin and power: a) 0 < margin <= 1 improves generalization; and b) power > 1 controls how strongly the margin minus the difference between pred1 and pred0 is amplified; the margin-difference is raised to power before being returned.

In examples they try margin = 0.3 and power = 2; or margin = 0.2 and power = 3; but these should be optimized in an outer loop.

Yan: "In general, we can choose a value between 0.1 and 0.7 for margin. Also, we have found that power = 2 or 3 achieves similar, and generally the best results."

INVAR: power > 1.0 INVAR: 0 < margin <= 1 These invariants should be checked by the caller for speed and to allow potential inlning. They are not checked within R1().

func SumSlice

func SumSlice[T Addable](x []T) (tot T)

func TSPrintf

func TSPrintf(format string, a ...interface{})

time-stamped printf

func VV

func VV(format string, a ...interface{})

Types

type AUCroc

type AUCroc struct {
	TargetRange TargetRange
	Auc1        float64
	Auc0        float64
	U1          float64
	U0          float64

	N1 float64
	N0 float64

	// Objective function value, set if TargetRrange.ComputeObjective is true.
	Obj float64
}

AUCroc is returned by AreaUnderROC() call.

func AreaUnderROC

func AreaUnderROC(predictor, target []float64, targetRanges []*TargetRange) (res []*AUCroc, err error)

AreaUnderROC() is for classifying a target real values as <= targetRanges[i].Threshold versus target > targetRanges[i].Threshold, for each i over the supplied targetRanges. The AUC for ROC is equivalent to the Wilcoxon or Mann-Whitney U test statistic with the relation:

AUC = U/(n0 * n1)

No NaN handling at present so handle those prior. This could be added without too much difficulty. VposSlice will sort them to the end, but another pass would be needed to keep pair-wise complete target/predictor pairs.

returns: auc1 is the area under the curve for classifying target > targetRanges[i].Threshold (typically what one wants). auc0 is the area under the curve for classifying target <= targetRanges[i].Threshold.

type Addable

type Addable interface {
	~complex128 | ~complex64 | ~float64 | ~float32 | ~byte | ~uint16 | ~uint32 | ~uint64 | ~int8 | ~int16 | ~int32 | ~int64 | ~int
}

Addable is the type constraint for Matrix. Since we have an Add() method, the elements of the Matrix must support addition.

type BoolMatrix

type BoolMatrix struct {
	Nrow int
	Ncol int

	Colnames []string
	Rownames []string

	IsColMajor bool // row major by default
	Dat        []bool

	// track metadata by column/row
	Cmeta []*FeatMeta
	Rmeta []*FeatMeta
}

BoolMatrix is a matrix of bool. Since bool is not Addable, we cannot use Matrix[T].

func NewBoolMatrix

func NewBoolMatrix(nrow, ncol int) *BoolMatrix

NewBoolMatrix allocates room for nrow * ncol elements.

func (*BoolMatrix) AddRow

func (m *BoolMatrix) AddRow(rowlabel string) (i int)

AddRow extends the matrix by one row and returns the index to the new row. The new row is all 0. This can be pretty fast if m is row major; and can be pretty slow if not. Pass empty string for rowlabel if not using them.

func (*BoolMatrix) At

func (m *BoolMatrix) At(i, j int) bool

At reads out the [i,j]-th element.

func (*BoolMatrix) Cbind

func (m *BoolMatrix) Cbind(m2 *BoolMatrix)

Cbind will append the columns of m2 on to the right side of m, updating m in-place.

func (*BoolMatrix) Clone

func (m *BoolMatrix) Clone() (clone *BoolMatrix)

Clone returns a fresh copy of m, with no shared state.

func (*BoolMatrix) Col

func (m *BoolMatrix) Col(j int) (res []bool)

Col will return the underlying slice from .Dat of column j if the the matrix is in column-major order; otherwise it will return a coalesced copy and changing res will have no impact on .Dat.

func (*BoolMatrix) DeleteCols

func (m *BoolMatrix) DeleteCols(wcol []int)

DeleteCols deletes from m the 0-based column numbers listed in wcol.

func (*BoolMatrix) FillColMajor

func (m *BoolMatrix) FillColMajor(slc []bool, makeCopy bool)

FillColMajor copies slc into Dat, and sets IsColMajor to true If makeCopy then we'll make our own copy of slc; otherwise just point to it.

func (*BoolMatrix) FillRowMajor

func (m *BoolMatrix) FillRowMajor(slc []bool, makeCopy bool)

FillRowMajor copies slc into Dat, and sets IsColMajor to false. If makeCopy then we'll make our own copy of slc; otherwise just point to it.

func (*BoolMatrix) ReformatToColumnMajor

func (m *BoolMatrix) ReformatToColumnMajor()

ReformatToColumnMajor will actually re-write the data in .Dat, if need be, to be column major: to have each columns's data adjacent so advancing the index of .Dat by 1 goes to the next row; or to the top of the next column if at the last row.

Be aware that the Row() fetches from m will be slower; but reading a whole column will be faster of course.

This is a no-op if the Matrix already has IsColMajor true.

func (*BoolMatrix) ReformatToRowMajor

func (m *BoolMatrix) ReformatToRowMajor()

ReformatToRowMajor will actually re-write the data in .Dat, if need be, to be row major: to have each rows's data adjacent so advancing the index of .Dat by 1 goes to the next column; or to the beginning of the next row if at the last column. This is a no-op if the Matrix already has IsColMajor false.

func (*BoolMatrix) Row

func (m *BoolMatrix) Row(i int) (res []bool)

Row will return the underlying slice from .Dat of row i if the the matrix is in row-major order; otherwise it will return a coalesced copy and changing res will have no impact on .Dat.

In other words, it will try and do as little work as possible to return a readable copy of the data. But if you need to write into it... make sure that you have a row-major matrix; or use WriteRow to write it back at the end. (And comment out the panic that warns about this.

func (*BoolMatrix) RowChunk

func (m *BoolMatrix) RowChunk(beg, endx int) (r *BoolMatrix)

RowChunk is like Row, but returns multiple rows in row-major form. All columns are returned.

func (*BoolMatrix) Set

func (m *BoolMatrix) Set(i, j int, v bool)

Set v as the value for [i,j]-th element.

func (*BoolMatrix) String

func (m *BoolMatrix) String() (r string)

String satisfies the common Stringer interface. It provides a view of the contents of the Matrix m.

func (*BoolMatrix) Transpose

func (m *BoolMatrix) Transpose()

Transpose flips the Matrix without changing Dat. It turns m into its transpose efficiently. Only meta data describing how to access the rows and columns is adjusted, and this is very quick.

func (*BoolMatrix) WriteCol

func (m *BoolMatrix) WriteCol(j int, writeme []bool)

WriteCol will replace column j with writeme, which must have length m.Row.

func (*BoolMatrix) WriteRow

func (m *BoolMatrix) WriteRow(i int, writeme []bool)

WriteRow will replace row i with writeme, which must have length m.Ncol.

type ColumnKind

type ColumnKind int
const (
	FACTOR  ColumnKind = 1
	NUMERIC ColumnKind = 2
)

type CsvLoader2

type CsvLoader2 struct {
	Path   string
	File   *os.File
	Gz     *gzip.Reader
	Csv    *csv.Reader
	Header []string
}

func NewCsvLoader2

func NewCsvLoader2(path string) (*CsvLoader2, error)

detects .gz suffix and reads using gunzip. if path is "-" we read from stdin

func (*CsvLoader2) Close

func (s *CsvLoader2) Close() error

func (*CsvLoader2) ReadOne

func (s *CsvLoader2) ReadOne() ([]string, error)

type CutCovFactor

type CutCovFactor struct {
	Covariate []float64
	Factor    []int

	Meta FeatMeta
}

func FactorFromCovariate

func FactorFromCovariate(colj int, covname string, covariate []float64, cuts []float64, rightclosed bool, myMat *Matrix[int]) *CutCovFactor

func NewCutCovFactor

func NewCutCovFactor(colname string, colj int, covariate, cuts []float64, rightclosed bool, myMat *Matrix[int]) (r *CutCovFactor)

func (*CutCovFactor) SetLevels

func (f *CutCovFactor) SetLevels()

type FeatMeta

type FeatMeta struct {
	Name        string
	Colj        int
	IsFactor    bool
	Cuts        []float64
	Levels      []string
	LevelCount  int
	Rightclosed bool
	IsOrdinal   bool

	FactorMap    map[string]int
	InvFactorMap map[int]string

	MyMat any // our underlying matrix
}

FeatMeta gives Feature Metadata for each column, especially details about factors and their levels, cuts, and IsOrdinal

func NewFeatMeta

func NewFeatMeta() *FeatMeta

func (*FeatMeta) String

func (f *FeatMeta) String() (r string)

type Float32Slice

type Float32Slice []float32

standard stuff to allow sort.Sort() to sort a slice.

func (Float32Slice) Len

func (p Float32Slice) Len() int

func (Float32Slice) Less

func (p Float32Slice) Less(i, j int) bool

func (Float32Slice) Swap

func (p Float32Slice) Swap(i, j int)

type LexCodeSlice

type LexCodeSlice []lexcode

LexCodeSlice facilitates sorting by factor name lexically

func (LexCodeSlice) Len

func (p LexCodeSlice) Len() int

func (LexCodeSlice) Less

func (p LexCodeSlice) Less(i, j int) bool

func (LexCodeSlice) String

func (p LexCodeSlice) String() (r string)

func (LexCodeSlice) Swap

func (p LexCodeSlice) Swap(i, j int)

type Matrix

type Matrix[T Addable] struct {
	Nrow int
	Ncol int

	Colnames []string
	Rownames []string

	IsColMajor bool // row major by default
	Dat        []T

	// track metadata by column/row; and don't share
	// with pointers, use values here, so each Matrix
	// gets its own copy, and we can update Colj without
	// impacting the origin Matrix.
	Cmeta []FeatMeta
	Rmeta []FeatMeta
}

Matrix is used to represent our Histograms among other things.

func FactorFromCovarMatrix

func FactorFromCovarMatrix(real *Matrix[float64]) (fac *Matrix[int])

func NewMatrix

func NewMatrix[T Addable](nrow, ncol int) (m *Matrix[T])

NewMatrix allocates room for nrow * ncol elements.

func (*Matrix[T]) Add

func (m *Matrix[T]) Add(i, j int, v T)

Add v to the [i,j] element of the Matrix.

func (*Matrix[T]) AddRow

func (m *Matrix[T]) AddRow(rowlabel string) (i int)

AddRow extends the matrix by one row and returns the index to the new row. The new row is all 0. This can be pretty fast if m is row major; and can be pretty slow if not. Pass empty string for rowlabel if not using them.

func (*Matrix[T]) At

func (m *Matrix[T]) At(i, j int) T

At reads out the [i,j]-th element.

func (*Matrix[T]) Cbind

func (m *Matrix[T]) Cbind(m2 *Matrix[T])

Cbind will append the columns of m2 on to the right side of m, updating m in-place.

func (*Matrix[T]) Clone

func (m *Matrix[T]) Clone() (clone *Matrix[T])

Clone returns a fresh copy of m, with no shared state.

func (*Matrix[T]) CmetaDisplay

func (m *Matrix[T]) CmetaDisplay() (feaDisplay []string)

CmetaDisplay returns just the essentials of m.Cmeta for diagnostics

func (*Matrix[T]) Col

func (m *Matrix[T]) Col(j int) (res []T)

Col will return the underlying slice from .Dat of column j if the the matrix is in column-major order; otherwise it will return a coalesced copy and changing res will have no impact on .Dat.

func (*Matrix[T]) DeleteCols

func (m *Matrix[T]) DeleteCols(wcol []int)

DeleteCols deletes from m the 0-based column numbers listed in wcol.

func (*Matrix[T]) ExtractFeatAsMatrix

func (m *Matrix[T]) ExtractFeatAsMatrix(factors []FeatMeta) (r *Matrix[T])

func (*Matrix[T]) ExtractRowsColsAsMatrix

func (m *Matrix[T]) ExtractRowsColsAsMatrix(rowbeg, rowendx int, wcol []int) (r *Matrix[T])

func (*Matrix[T]) FillColMajor

func (m *Matrix[T]) FillColMajor(slc []T, makeCopy bool)

FillColMajor copies slc into Dat, and sets IsColMajor to true If makeCopy then we'll make our own copy of slc; otherwise just point to it.

func (*Matrix[T]) FillRowMajor

func (m *Matrix[T]) FillRowMajor(slc []T, makeCopy bool)

FillRowMajor copies slc into Dat, and sets IsColMajor to false. If makeCopy then we'll make our own copy of slc; otherwise just point to it.

func (*Matrix[T]) GetRowIter

func (m *Matrix[T]) GetRowIter(begrow, endxrow, chunk int) (r *RowIter[T])

GetRowIter returns an iterator that will read [beg, endxrow) rows of m, by requesting Rowset()s of chunk rows at a time. The endxrow parameter allows us to read fewer than m.Nrow elements all in, if desired.

Single column vectors are supported so that Matrix can be used the chunk out simple vectors too. The only current restriction is that we will return *all* the columns in our rowset, so to omit columns you may need to DeleteCols to adjust the shape of m before hand; say to remove any target column, for example.

func (*Matrix[T]) NewRowColIter

func (m *Matrix[T]) NewRowColIter(factors []FeatMeta, begrow, length, chunk int, name string) (rci *RowColIter[T])

func (*Matrix[T]) ReformatToColumnMajor

func (m *Matrix[T]) ReformatToColumnMajor()

ReformatToColumnMajor will actually re-write the data in .Dat, if need be, to be column major: to have each columns's data adjacent so advancing the index of .Dat by 1 goes to the next row; or to the top of the next column if at the last row.

Be aware that the Row() fetches from m will be slower; but reading a whole column will be faster of course.

This is a no-op if the Matrix already has IsColMajor true.

func (*Matrix[T]) ReformatToRowMajor

func (m *Matrix[T]) ReformatToRowMajor()

ReformatToRowMajor will actually re-write the data in .Dat, if need be, to be row major: to have each rows's data adjacent so advancing the index of .Dat by 1 goes to the next column; or to the beginning of the next row if at the last column. This is a no-op if the Matrix already has IsColMajor false.

func (*Matrix[T]) Reshape

func (m *Matrix[T]) Reshape(newNrow, newNcol int)

Reshape does not change Dat, but re-assigns Nrow = newNrow and Ncol = newNcol. It also discards m.Colnames and m.Rownames. It will reinitialize Cmeta to be newNcol long; but that looses all Cmeta[i].Names and any other meta information that they contained. So avoid Reshape unless you can re-create any needed Cmeta information.

func (*Matrix[T]) Row

func (m *Matrix[T]) Row(i int) (res []T)

Row will return the underlying slice from .Dat of row i if the the matrix is in row-major order; otherwise it will return a coalesced copy and changing res will have no impact on .Dat.

In other words, it will try and do as little work as possible to return a readable copy of the data. But if you need to write into it... make sure that you have a row-major matrix; or use WriteRow to write it back at the end. (And comment out the panic that warns about this.

func (*Matrix[T]) RowChunk

func (m *Matrix[T]) RowChunk(beg, endx int) (r *Matrix[T])

RowChunk is like Row, but returns multiple rows in row-major form. All columns are returned.

func (*Matrix[T]) Set

func (m *Matrix[T]) Set(i, j int, v T)

Set v as the value for [i,j]-th element.

func (*Matrix[T]) String

func (m *Matrix[T]) String() (r string)

String satisfies the common Stringer interface. It provides a view of the contents of the Matrix m.

func (*Matrix[Addable]) SumAll

func (m *Matrix[Addable]) SumAll() (tot Addable)

SumAll returns the sum of all elements in m.

func (*Matrix[T]) Transpose

func (m *Matrix[T]) Transpose()

Transpose flips the Matrix without changing Dat. It turns m into its transpose efficiently. Only meta data describing how to access the rows and columns is adjusted, and this is very quick.

func (*Matrix[T]) WriteCol

func (m *Matrix[T]) WriteCol(j int, writeme []T)

WriteCol will replace column j with writeme, which must have length m.Row.

func (*Matrix[T]) WriteRow

func (m *Matrix[T]) WriteRow(i int, writeme []T)

WriteRow will replace row i with writeme, which must have length m.Ncol.

type Node

type Node struct {
	IsLeaf bool // else issplit

	// leaf attributes
	Gsum float64
	Hsum float64
	//GHsum    []float64 // []float64{rightgsum, righthsum} for instance; always 2 long... same as Gsum, Hsum
	CanSplit bool

	// key is the feature name now, not the *FeatMeta
	Partitions map[string][]bool

	Loss float64
	Gain float64

	// middle/split node attributes
	Factor    FeatMeta
	Leftnode  *Node
	Rightnode *Node
	IsActive  bool
}

func NewNode

func NewNode() (n *Node)

func (*Node) String

func (n *Node) String() (r string)

type RowColIter

type RowColIter[T Addable] struct {
	// contains filtered or unexported fields
}

func (*RowColIter[T]) FetchAdv

func (rci *RowColIter[T]) FetchAdv() (r *Matrix[T], done bool)

func (*RowColIter[T]) FetchAdv1

func (rci *RowColIter[T]) FetchAdv1() (r *Matrix[T])

type RowIter

type RowIter[T Addable] struct {
	// contains filtered or unexported fields
}

RowIter refers to a row-slice of a Matrix; see the GetRowIter() method on the Matrix below.

func NewRowIter

func NewRowIter[T Addable](m *Matrix[T], beg, length, chunk int) *RowIter[T]

NewRowIter makes a row iterator. See also the method GetRowIter on Matrix.

func (*RowIter[T]) Adv

func (ri *RowIter[T]) Adv() (done bool)

Adv advances the row iterator

func (*RowIter[T]) Fetch

func (ri *RowIter[T]) Fetch() (r *Matrix[T], done bool)

Fetch returns the current row set, without advancing

func (*RowIter[T]) FetchAdv

func (ri *RowIter[T]) FetchAdv() (r *Matrix[T], done bool)

return current row and then advance, so the next Fetch or FetchAdv will read starting with the beg row.

func (*RowIter[T]) FetchAdv1

func (ri *RowIter[T]) FetchAdv1() (r *Matrix[T])

FetchAdv1 just returns nil if done, without a separate done flag. Otherwise identical to FetchAdv() which it calls.

func (*RowIter[T]) FetchAdvBX

func (ri *RowIter[T]) FetchAdvBX() (beg, endx int, done bool)

FetchAdvBX does FetchBX() and then advances the iterator to the next chunk.

Specifically, FetchAdvBX returns the current chunk of rows, pointed to by the [beg, endx) return values, and then advances the iterator to the next chunk of rows to be read.

The length of the returned range is always endx - beg; so [0, 0) is an empty range. The size of the range will be ri.chunk unless there are insufficient elements left before hitting the endxrow point.

The returned range is empty iff done is returned true; so always check done first. See also FetchAdv to get a row range without advancing the iterator.

If done is returned true, then beg and endx are undefined and should be ignored.

func (*RowIter[T]) FetchBX

func (ri *RowIter[T]) FetchBX() (beg, endx int, done bool)

FetchBX returns the current chunk of rows, pointed to by the [beg, endx) returned values. The returned range is empty iff done is returned true; so always check done first.

Concretely, the length of the returned range is always endx - beg; so [0, 0) is an empty range. The size of the range will be ri.chunk unless there are insufficient elements left before hitting the endxrow point.

See also FetchAdvBX to read the current chunk and then advance to the next.

If done is returned true, then beg and endx are undefined and should be ignored.

func (*RowIter[T]) FetchBegEndx

func (ri *RowIter[T]) FetchBegEndx() (beg, endx int, done bool)

FetchBegEndx just supplies the beg and endx row index that Fetch would return. This can be used to coordinate/compare with other row iterators or the VectorSlicer.

type SlurpDataFrame

type SlurpDataFrame struct {

	// nheader = number of fields in the header; nCol will have 2 less for the matrix,
	// since the matrix lacks the first 2 fields which are strings.
	Nheader int

	// the full header, as a single string. Fields separated by commas.
	Header string

	// the header broken out into fields.
	// includes tm,sym as the first two, so is 2 more than nCol, typically;
	// assuming they were present in the original header.
	Colnames []string

	// matching exactly the columns of Matrix, Ncol long
	MatrixColnames []string

	// the numeric, float64 data.
	Matrix []float64

	// number of numeric data colums in matrix (not counting tm,sym)
	Ncol int

	// number of rows (not counting the header)
	Nrow int

	// Just the symbol (2nd column), from the first row.
	// They are probably all the same anyway.
	Sym string

	// the timestamps on the rows
	Tm []time.Time

	Frompath string

	// if the 2 string columns are missing
	Missing2strings bool

	// Instead of being all numeric features, instead we have
	// two parts, numeric features in NumericMat, and factors
	// in FactorMat, and if they were originally interlaced, they
	// are separated out into their own kind of matrix now.
	HasFactors bool
	Kindvec    []ColumnKind

	NumericMat *Matrix[float64]

	// Because we do not know how many factors we will need, and
	// because the initial converts all real features to factors,
	// we will initially deploy the FactorMat will a full int (64-bit integer)
	// work of factor room. Later, perhaps, this can be reduced to uint16 or uint8,
	// but that requires domain knowledge of the features at hand
	// on a case-by-case basis. For now we give ourselves a fighting
	// chance of handling any real-value feature with the full
	// generality of int numbered factors.
	FactorMat *Matrix[int]
	// contains filtered or unexported fields
}

SlurpDataFrame handles two type of data frames: those with all float64, and those with string columns. String columns are encoded into uint16 factor matrix.

The all float64 reading takes in a comma-separated-value (csv) files that has a special structure. After the header, the first two columns are expected to two contains strings, (a timestamp and symbol string, typically); and then all of the rest of the columns must be float64 values.

Since most of the work is parsing the float64, we try to do that in parallel and using large blocks of contiguous memory to allow the CPU caches and pipelining to be effective. We memory map the file to effect this.

When a .gz file path is supplied, this cannot be memory mapped; so we read it using the csv libraries, which can be slower.

func NewSlurpDataFrame

func NewSlurpDataFrame(missing2strings bool) *SlurpDataFrame

func (*SlurpDataFrame) Disgorge

func (df *SlurpDataFrame) Disgorge(path string) (err error)

Disgorge writes the matrix/data-frame back to disk. As you might guess, this is really slow. It is useful, however, to show that we parsed the original correctly, and can reconstruct it precisely if need be.

func (*SlurpDataFrame) ExtractCols

func (sdf *SlurpDataFrame) ExtractCols(xi0, xi1 int, wcol []int) (n, nvar int, xx []float64, colnames []string)

ExtractCols extracts the wcol columns from sdf. The rows are from [xi0:xi1). See also ExtractXXYY if the X cols desired are a continguous range.

xi0 : row index of first X data xi1 : the excluded endx index of a row that is just after the last included row

n returns the number of rows back in xx; nvar returns the numer of variables back in xx; xx is the matrix (or vector if nvar == 1) of data extracted from sdf;

////

func (*SlurpDataFrame) ExtractXXYY

func (sdf *SlurpDataFrame) ExtractXXYY(xi0, xi1, xj0, xj1, yj int) (n, nvar int, xx, yy []float64, colnames []string, targetname string)

ExtractXXYY extracts a contiguous X range and one Y variable from sdf. See also ExtractCols if the X cols desired are not a continguous range.

xi0 : row index of first X data xi1 : endx row to use (excluded)

xj0 : first X column to use xj1 : endx X column to use (excluded)

yj : target Y column to use, just another column index into the same data frame (targets at the end)

sdf : the data frame to grab the X and Y data from

Note: we need to copy the X and Y data anyway, generaly.

///

func (*SlurpDataFrame) FindTm

func (df *SlurpDataFrame) FindTm(tm time.Time, si time.Duration) (rowi int, err error)

locate the row at or prior to tm. Can return -1 if tm is before us, or -2 if tm is after us. Checks within 1 minute or <= 2*si too, near the first and last, since that is a common case where the actual sample row will be close but maybe not exactly at the boundaries.

func (*SlurpDataFrame) MatFullRow

func (df *SlurpDataFrame) MatFullRow(irow int) []float64

func (*SlurpDataFrame) MatPartRow

func (df *SlurpDataFrame) MatPartRow(irow, leftCount int) []float64

returns the first leftCount elements of irow; useful to pick out just the training data if it is all on the left side of the matrix.

func (*SlurpDataFrame) MatrixAt

func (df *SlurpDataFrame) MatrixAt(irow, jcol int) float64

get element from the matrix, ignoring the first 2 string columns if they exist. irow and jcol are 0 based.

func (*SlurpDataFrame) ReadGzipped

func (df *SlurpDataFrame) ReadGzipped(path string) (err error)

ReadGzipped is used when we have a compressed csv file we cannot directly memory map.

func (*SlurpDataFrame) Row

func (df *SlurpDataFrame) Row(i int) (tm time.Time, dat []float64)

func (*SlurpDataFrame) RowSlice

func (df *SlurpDataFrame) RowSlice(i int, nSpan int) (rowslice []float64)

RowSlice can use nSpan to request just the past-looking predictors at the beginning of the row; i.e. without future targets which are typically at the end. nSpan must be >= 1, else the rowslice returned will be empty. The returned rowslice will be nSpan in length, being the row i sliced to [0:nSpan]

func (*SlurpDataFrame) Slurp

func (df *SlurpDataFrame) Slurp(path string) (err error)

type TargetRange

type TargetRange struct {
	// We just have just one threshold for now, but could be a range
	// or one range vs. another for ordinal/multiclass disrimination.
	Thresh float64

	// should we compute the U_R objective function, a differentiable
	// approximation to the Area under ROC, but designed to be minimized.
	//
	// Not implemented: actual optimization by using the derivative
	// of the objective with respect to inputs X. Hence think of
	// this simply as a proof of concept to allow comparison of
	// the actual ROC and the approximation function to check for
	// applicability on a given data context.
	//
	ComputeObjective bool
	// If ComputeObjective is true, then Margin and Power must be set.
	Margin float64 // Margin must be > 0, and Margin must <= 1.  Starting guess might be 0.2
	Power  float64 // Power must be > 1. Starting guess might be 3.0
}

TargetRange defines how to map a real target feature to a category/class.

type Tree

type Tree struct {
	Nodes []*Node
}

type TreeGrowState

type TreeGrowState struct {
	NodeIDs []int
	Nodes   []*Node

	Ensemble XGTreeEnsemble

	Factors *Matrix[int]

	// gradients
	Gcovariate []float64

	// hessians; only the diagonal though, presumably
	Hcovariate []float64

	Gamma       float64
	Lambda      float64
	MinH        float64
	SliceLength int

	Ordstumps    bool
	OptSplit     bool
	Pruning      bool
	Leafwise     bool
	Singlethread bool
}

type VectorSlicer

type VectorSlicer[T any] struct {
	// contains filtered or unexported fields
}

func NewVectorSlicer

func NewVectorSlicer[T any](vec []T, beg, length, chunk int, name string) (r *VectorSlicer[T])

func (*VectorSlicer[T]) Adv

func (vs *VectorSlicer[T]) Adv() (done bool)

Adv advances the iterator by at least chunk. If there were any available rows, done will be false. If there were no more available rows, done will return true.

func (*VectorSlicer[T]) Fetch

func (vs *VectorSlicer[T]) Fetch() (r []T, done bool)

Fetch returns the current slice of vs.vec in r, without advancing. If there were any available elements, done will return false. Otherwise, when done comes back true, r will be length 0. So always check the done flag before using r.

func (*VectorSlicer[T]) FetchAdv

func (vs *VectorSlicer[T]) FetchAdv() (r []T, done bool)

FetchAdv returns current slice (of length <= chunk); and then advances beg by chunk (or less if we run out of elements). If done returns true, because we ran out of elements on the last Fetch() or FetchAdv() call, then the r returned will be length 0.

func (*VectorSlicer[T]) FetchAdv1

func (vs *VectorSlicer[T]) FetchAdv1() (r []T)

FetchAdv1 has one return, no done. Otherwise same as FetchAdv.

func (*VectorSlicer[T]) FetchAdvBX

func (vs *VectorSlicer[T]) FetchAdvBX() (beg, endx int, done bool)

FetchBX returns the coordinates of the next chunk in [beg, endx) and then advances the iterator to the next chunk.

func (*VectorSlicer[T]) FetchBX

func (vs *VectorSlicer[T]) FetchBX() (beg, endx int, done bool)

FetchBX returns the coordinates of the next chunk in [beg, endx) without advancing the iterator.

type VposSlice

type VposSlice []vpos

VposSlice facilitates sorting.

func (VposSlice) Len

func (p VposSlice) Len() int

func (VposSlice) Less

func (p VposSlice) Less(i, j int) bool

func (VposSlice) String

func (p VposSlice) String() (r string)

func (VposSlice) Swap

func (p VposSlice) Swap(i, j int)

type XGModel

type XGModel struct {
	Ensemble XGTreeEnsemble
	Lambda   float64
	Gamma    float64
	Eta      float64
	Minh     float64
	Maxdepth int
	Pred     []float64
}

type XGTreeEnsemble

type XGTreeEnsemble struct {
	Trees []*Tree
}

Directories

Path Synopsis
cmd
vft

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL