ml-essentials

module

v0.1.0 Latest Latest Go to latest Published: Feb 2, 2021 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/rom1mouret/ml-essentials

Links

Open Source Insights

README ¶

ml-essentials is a data frame library for Go in the same vein as qota and qframe.

It draws inspiration from pandas and numpy.

Unlike qota and qframe, ml-essentials doesn't cater for data scientists, e.g. with functions to load Excel files, SQL databases or functions to help with EDA. It is best suited for machine learning engineers who want to serve their models in a safe and predictable manner. It is also smaller, with a focus on simplicity, stability and clarity.

I hope that ml-essentials is transparent enough for users to glance at their code and get a sense of what ml-essentials does under the hood and how much it is going to cost in CPU and RAM usage. To illustrate my point, I am enumerating below all the view-returning functions. Those features are only available through views, so the user has no choice but to spell out what his/her code should do. For instance, shuffled := df.ShuffleView().Copy() does exactly what it looks like.

(df *DataFrame) IndexView(indices []int) *DataFrame
(df *DataFrame) SliceView(from int, to int) *DataFrame
(df *DataFrame) MaskView(mask []bool) *DataFrame
(df *DataFrame) ColumnView(columns ...string) *DataFrame
(df *DataFrame) ShuffleView() *DataFrame
(df *DataFrame) SampleView(n int, replacement bool) *DataFrame
(df *DataFrame) SplitNView(n int) []*DataFrame
(df *DataFrame) SplitView(batchSize int) []*DataFrame
(df *DataFrame) SplitTrainTestViews(testingRatio float64) (*DataFrame, *DataFrame)
(df *DataFrame) SortedView(byColumn string) *DataFrame
(df *DataFrame) TopView(byColumn string, n int, ascending bool, sorted bool) *DataFrame
(df *DataFrame) ReverseView() *DataFrame
(df *DataFrame) HashStringsView(columns ...string) *DataFrame
(df *DataFrame) DetachedView(columns ...string) *DataFrame
(df *DataFrame) ResetIndexView() *DataFrame
(df *DataFrame) ShallowCopy() *DataFrame
(df *DataFrame) ColumnConcatView(dfs ...*DataFrame) (*DataFrame, error)

View-returning functions are guaranteed not to copy any large chunk of data.

Documentation and examples

Benchmarks

dataset: kddcup98

task: linear regression

	ml-essentials CPU=1	ml-essentials CPU=16	python (pandas + pytorch)
reading CSV	18.3	3.3	4.3
shuffling and splitting	0.003	0.003	0.4
preprocessing fit_transform	2.4	0.8	2.2
linreg training (1 epoch)	6.9	6.9	3.4
preprocessor on test data	1	0.5	0.77
writing predictions	33	4.7	426
reading written rows	170	71	410

The reason it takes so long to read/write predictions is because one-hot encoding creates over 20,000 columns.

Reproduction

cd examples
go run linreg.go -momentum=0.2 -epochs=1 -testratio=0.33 -batchsize 256 cup98LRN.txt TARGET_B CONTROLN
python3 linreg.py -momentum=0.2 -epochs=1 -testratio=0.33 -batchsize 256 cup98LRN.txt TARGET_B CONTROLN

Design choices

Native types

Here are the benchmarks that have motivated my decision to use 3 native types alongside interface{}. Those benchmarks measure the time to copy a slice at specific indices (from a slice of indices).

type	speed	storage choice	missing value
`[]interface{}`	4.51 ns/op	`[]interface{}`	nil
`[]string`	4.26 ns/op	`[]interface{}`	nil
`[]float64`	1.97 ns/op	`[]float64`	NaN
`[]int`	1.80 ns/op	`[]int`	-1
`[]bool`	1.38 ns/op	`[]bool`	not applicable

Float64 were chosen over float32 for the sake of compatibility with gonum.

`interface{}` type for all the columns

Storing all the data slices as interface{} is sound. For one thing, this requires only one map[string]interface{}. By contrast, ml-essentials allocates 5 map[string]T, even when empty. Also, some functions get to be very succinct, for instance rename can move the data from one column to another without ever knowing what type the data is of.

Ultimately, it was decided not to use interface{} for everything. Most functions do rely on knowing the precise type and casting the values anyway. The first version used interface{} everywhere and lots of type assertion errors popped up. Although they were easy to fix, the new implementation brings more peace of mind.

Roadmap

functions to store/retrieve gonum's blas vectors in the df.objects map
functions to store/retrieve/sort datetime objects in the df.objects map
functions to create masks, e.g. mask := df.Test("age").Lower(15).Mask()
smarter ColumnSmartConcat function
ordinal encoder as an alternative to Hash Encoder
more methods to RawData, like some sort of concat
optimization of TopView
more options to CSV reader and writer, such as BOM parsing
inverse transform for OneHot
RepeatView(n int, bool interleaved)
more evaluation metrics, such as cross entropy
reading/writing data in JSON
release as a Go module

External Contributions

ml-essentials is not affiliated with any organization. Contributions are welcome.

Directories ¶

Path	Synopsis
algorithms
dataframe
examples
preprocessing
utils This package is intended to be used by other ml-essentials' packages.	This package is intended to be used by other ml-essentials' packages.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL