dataframe

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 2, 2021 License: Apache-2.0 Imports: 19 Imported by: 0

README

This README covers the basics.

GoDoc

link

Imports

import "github.com/rom1mouret/ml-essentials/dataframe"

DataFrame construction

DataFrames accept 4 types of columns.

type missing value comment
float64 NaN
int -1 meant to store categorical values
bool not supported
interface{} nil called "object" columns

Strings are stored in the interface{} columns. ml-essentials distinguishes between regular object columns and string columns by keeping around the names of the string columns. Some functions are specialized for string columns, e.g. Encode(newEncoding encoding.Encoding).

Storing categorical values is the preferred use of integer columns. That said, you are free to use them to store any kind of integers, including negative integers. Negative integers won't be treated as missing values unless you run IntImputer.

Construction with a DataBuilder
builder := dataframe.DataBuilder{RawData: dataframe.NewRawData()}
builder.AddFloats("height", 170, 180, 165)
builder.AddStrings("name", "Karen", "John", "Sophie")
df := builder.ToDataFrame()
df.PrintSummary().PrintHead(-1, "%.3f")
Construction from a CSV file
spec := dataframe.CSVReadingSpec{
  MaxCPU: -1,
  MissingValues: []string{"", " ", "NA","-"},
  IntAsFloat: true,
  BoolAsFloat: false,
  BinaryAsFloat: true,
}
rawdata, err := dataframe.FromCSVFile("/path/to/csvfile.csv", spec)

or

rawdata, err := dataframe.FromCSVFilePattern("/path/to/csvdir/*.csv", spec)

Column names

You can manipulate column names via the ColumnHeader structure.

h := df.FloatHeader().And(df.IntHeader()).Except("target", "id").NameList()

Iterate over a dataframe

Option 1: ColumnAccess
height := df.Floats("height")
for i := 0; i < height.Size(); i++ {
  height.Set(i, height.Get(i) / 2)
}
Option 2: Gonum Batching
batching := dataframe.NewDense64Batching([]string{"age", "height", "gender"})
for _, batch := range df.SplitView(params.BatchSize) {
  // get a gonum matrix with columns age, height and gender (in that order)
  rows := batching.DenseMatrix(batch)
}
Option 3: Row Iterator
iterator := NewFloat32Iterator(df, []string{"age", "height", "gender"})
for row, rowIdx, _ := iterator.NextRow(); row != nil; row, rowIdx, _ = iterator.NextRow() {
  // row is a float32 slice
}

Views

Views are dataframes that share data with other dataframes. There is no View type and DataFrame type. Both are of type DataFrame. Quick example:

view := df.ShuffleView()
view.OverwriteInts("level", []int{4, 1, 2, 1})

Here view shares its data with df. This is useful in two ways. First, ShuffleView doesn't copy the data, thus it is fast and memory-efficient. Second, it allows you to overwrite df's data from anywhere in your program. The "side effects" section explains why this is an advantage when it comes to handling indexed data.

If you want to avoid such side effects, you can detach the view from its parent dataframe.

view := df.ShuffleView().DetachedView("level")
view.OverwriteInts("level", []int{4, 1, 2, 1})

Now, OverwriteInts does not alter df because view has its own level data. Other columns of df remain shared.

ml-essentials provides a variety of functions to manage data copies at a fine-grained level.

View < TransferRawDataFrom < ShallowCopy < Unshare < DetachedView < Copy

On one side of the spectrum, View only copies pointers. On the opposite side, Copy copies almost everything. View, DetachedView and Copy cover 99% of the cases.

View is handy if you want to execute an in-place operation without altering the original dataframe, as in this example:

view := df.View()
view.Rename("level", "degree")

Now, view and df still share their data, but their columns are named differently.

Side effects

Side effects are normally considered anti-patterns but they do facilitate manipulating indexed data. For instance, consider this scenario:

  1. at the top level, the data is separated into "features" and "metadata". Example of metadata: unique identifier, timestamps.
  2. the model makes predictions from the features and predictions with low confidence are thrown away.
  3. back to the top level, we combine "metadata" columns with predictions using the indices of high-confidence rows.

Step 3 is error-prone. With ml-essentials, the idiomatic way is to avoid separating "features" and "metadata" in the first place. Instead, we would rely on views to enforce that the metadata always aligns with the features and predicted values.

Among the way Pandas can solve this problem, it can combine "features" and "metadata" in an index-aware fashion, but this makes pandas.concat error-prone in other scenarios, like when it fills dataframes with NaN where indices don't align, that is if ignore_index is left to its default value.

Filtering, masking and indexing

Unlike Pandas and Numpy, there is no syntactic sugar to create masks and index arrays. Sugar aside, this section will look familiar to Pandas and Numpy users.

If you want to filter rows where "age" is over 18, you can do so with MaskView:

ages := df.Floats("age")
mask := df.EmptyMask()
for i := 0; i < ages.Size(); i++ {
  mask[i] = ages.Get(i) >= 18
}
view := df.MaskView(mask)

Getting a mask from EmptyMask() is advantageous because it recycles []bool slices across dataframes, but it is not mandatory.

Equivalent filtering with IndexView:

ages := df.Floats("age")
indices := make([]int, 0, ages.Size())
for i := 0; i < ages.Size(); i++ {
  if ages.Get(i) >= 18 {
    indices = append(indices, i)
  }
}
view := df.IndexView(indices)

In the future, we may add syntactic sugar for common scenarios, e.g. Condition("age").Higher(18).

Write in a dataframe

You can use the Set function as shown above. Alternatively, you might find it more convenient to write an entire column in one line of code:

df.OverwriteFloats64("height", []float64{170, 180, 165})

This is almost the same as:

height := df.Floats("height")
height.Set(0, 170)
height.Set(1, 180)
height.Set(2, 165)

The only difference is that OverwriteFloats64 will create a new column if it doesn't already exist.

Complete example

This is an example taken from linear_regression.go

import (
  "gonum.org/v1/gonum/mat"
  "github.com/rom1mouret/ml-essentials/dataframe"
)

func Predict(df *dataframe.DataFrame, batchSize int, resultColumn string) *dataframe.DataFrame {
  df = df.ResetIndexView() // makes batching.DenseMatrix faster

  // pre-allocation
  weights := mat.NewVecDense(len(reg.Weights), reg.Weights)
  pred := make([]float64, df.NumRows())

  // prediction
  batching := dataframe.NewDense64Batching(reg.Features)
  for i, batch := range df.SplitView(batchSize) {
    rows := batching.DenseMatrix(batch)
    offset := i * batchSize
    yData := pred[offset:offset+batch.NumRows()]
    yVec := mat.NewVecDense(len(yData), yData)
    yVec.MulVec(rows, weights)
  }

  // write the result in the output dataframe
  result := df.View()
  result.OverwriteFloats64("_target", pred)
  reg.TargetScaler.InverseTransformInplace(result)
  result.Rename("_target", resultColumn)

  return result
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CheckNoColumnOverlap

func CheckNoColumnOverlap(dfs []*DataFrame) error

CheckNoColumnOverlap returns an error if two or more dataframes have one or more columns in common, regardless of their type. It returns nil if there is no overlap.

Types

type BoolAccess

type BoolAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

BoolAccess is a random-access iterator for boolean columns.

func (BoolAccess) Get

func (access BoolAccess) Get(row int) bool

Get returns the boolean value at the given index.

func (BoolAccess) Set

func (access BoolAccess) Set(row int, val bool)

Set overwrites the boolean value at the given index.

type CSVReadingSpec

type CSVReadingSpec struct {
	// This is to multi-thread the type conversions.
	// Zero and negative values mean ALL cpus on your machine.
	// The created RawData will also inherit from this value.
	MaxCPU int

	// Optional header if the CSV has no header.
	Header []string

	// Columns to exclude.
	Exclude []string

	// List of string literals that will be interpreted as missing values.
	MissingValues []string

	// Read integers and/or bool as floats.
	IntAsFloat    bool
	BoolAsFloat   bool // 'true', 'false', '0' and '1' converted to 0.0 and 1.0
	BinaryAsFloat bool // '0' and '1' converted to 0.0 and 1.0

	// How the CSV is encoded.
	// if not provided, it will ignore the encoding and fallback to UTF-8 if a
	// conversion is needed.
	// Note:
	// - the CSV is not decoded at reading.
	// - you can run nearly every function of ml-essentials without ever knowing
	//   the encoding
	Encoding encoding.Encoding

	// Options from https://golang.org/src/encoding/csv/reader.go
	Comma            rune
	Comment          rune
	LazyQuotes       bool
	TrimLeadingSpace bool
}

type CSVWritingSpec

type CSVWritingSpec struct {
	// missing values will be replaced with this string. Default: ""
	StringMissingMarker string
	// Value used by ToCSVDir when splitting the dataframe into multiple files.
	// If there are more rows than MinRowsPerFile, it will be split depending on
	// the MaxCPU attached to the dataframe.
	// Default: 512 rows
	MinRowsPerFile int
	// If maintaining the row order is not important, I encourage you to set this
	// value to False.
	MaintainOrder bool
	// Options from https://golang.org/src/encoding/csv/writer.go
	Comma   rune // Field delimiter (set to ',' by NewWriter)
	UseCRLF bool // True to use \r\n as the line terminator

}

type ColumnAccess

type ColumnAccess struct {
	// contains filtered or unexported fields
}

func (ColumnAccess) SharedIndex

func (access ColumnAccess) SharedIndex(localIndex int) int

SharedIndex returns the index to the backing data, given a index to the column. You probably don't need this function.

func (ColumnAccess) Size

func (access ColumnAccess) Size() int

Size returns the length of the column.

type ColumnHeader

type ColumnHeader struct {
	// contains filtered or unexported fields
}

ColumnHeader helps you manipulate column names

func Columns

func Columns(names ...string) ColumnHeader

Columns create a ColumnHeader from a list of column names.

func (ColumnHeader) And

func (h ColumnHeader) And(others ...ColumnHeader) ColumnHeader

And add all the columns from the given other ColumnHeaders. It returns a shallow-copy of itself, not an entirely new ColumnHeader.

func (ColumnHeader) Copy

func (h ColumnHeader) Copy() ColumnHeader

Copy returns a deep-copy of the ColumnHeader

func (ColumnHeader) Except

func (h ColumnHeader) Except(columns ...string) ColumnHeader

Except removes all the columns given as arguments. It returns a shallow-copy of itself, not an entirely new ColumnHeader.

func (ColumnHeader) ExceptHeader

func (h ColumnHeader) ExceptHeader(others ...ColumnHeader) ColumnHeader

Except removes all columns from other ColumnHeaders. It returns a shallow-copy of itself, not an entirely new ColumnHeader.

func (ColumnHeader) NameList

func (h ColumnHeader) NameList() []string

NameList returns the list of columns in the header. Altering the returned slice won't alter ColumnHeader.

func (ColumnHeader) NameSet

func (h ColumnHeader) NameSet() map[string]bool

NameSet returns the set of columns in the header for read-only access. This is faster than NameList()

func (ColumnHeader) Num

func (h ColumnHeader) Num() int

Num returns the number of columns in the ColumnHeader

type DataBuilder

type DataBuilder struct {
	RawData *RawData
}

DataBuilder is a helper structure to build dataframes. Use dataframe.DataBuilder{RawData: dataframe.EmptyRawData()} to initialize it

func (DataBuilder) AddBools

func (builder DataBuilder) AddBools(col string, values ...bool) DataBuilder

AddFloats adds a list of bools to the given boolean column. It returns a shallow copy of itself.

func (DataBuilder) AddFloats

func (builder DataBuilder) AddFloats(col string, values ...float64) DataBuilder

AddFloats adds a list of floats to the given float column. It returns a shallow copy of itself.

func (DataBuilder) AddInts

func (builder DataBuilder) AddInts(col string, values ...int) DataBuilder

AddInts adds a list of ints to the given int column. It returns a shallow copy of itself.

func (DataBuilder) AddObjects

func (builder DataBuilder) AddObjects(col string, values ...interface{}) DataBuilder

AddObjects adds a list of objects to the given object column. It returns a shallow copy of itself. You can use this function to add strings too.

func (DataBuilder) AddStrings

func (builder DataBuilder) AddStrings(col string, values ...string) DataBuilder

AddStrings adds a list of strings to the given object column. It returns a shallow copy of itself. If you need to add nils (= missing value), use AddObjects(col, ...) followded by MarkAsString(col).

func (DataBuilder) MarkAsString

func (builder DataBuilder) MarkAsString(col string) DataBuilder

MarkAsString tags a given object column as a string-only column. This gives access to functionalities that generic object columns don't have.

func (DataBuilder) SetBools

func (builder DataBuilder) SetBools(col string, values []bool) DataBuilder

SetBools adds or replaces the values of the given boolean column. Values are not copied, so if you change them it will change them everywhere. It returns a shallow copy of itself.

func (DataBuilder) SetFloats

func (builder DataBuilder) SetFloats(col string, values []float64) DataBuilder

SetFloats adds or replaces the values of the given float column. Values are not copied, so if you change them it will change them everywhere. It returns a shallow copy of itself.

func (DataBuilder) SetInts

func (builder DataBuilder) SetInts(col string, values []int) DataBuilder

SetInts adds or replaces the values of the given integer column. Values are not copied, so if you change them it will change them everywhere. It returns a shallow copy of itself.

func (DataBuilder) SetObjects

func (builder DataBuilder) SetObjects(col string, values []interface{}) DataBuilder

SetObjects adds or replaces the values of the given object column. Values are not copied, so if you change them it will change them everywhere. It returns a shallow copy of itself. If you want to set a slice of strings, you'll need to convert the slice to a slice of interfaces and call MarkAsString(col).

func (DataBuilder) TextEncoding

func (builder DataBuilder) TextEncoding(encoding encoding.Encoding) DataBuilder

TextEncoding informs ml-essential that the strings that you have provided are encoded in the given encoding. If this function is never called or if nil is passed as argument, it will be assumed that all the strings are utf8-encoded. Even if the strings are not utf8-encoded, it is not mandatory to call this function since encoding is rarely ever used by ml-essentials. TextEncoding returns a shallow copy of itself.

func (DataBuilder) ToDataFrame

func (builder DataBuilder) ToDataFrame() *DataFrame

ToDataFrame() creates a dataframe out of the RawData object. It will panic if the columns are of different size. The returned dataframe shares its data and structure with the encapsulated rawdata.

type DataFrame

type DataFrame struct {
	RawData
	// contains filtered or unexported fields
}

DataFrame is a structure that lets you manipulate both original data and views on other dataframes' data by sharing the underlying data. The data is ordered by column.

func ColumnConcatView

func ColumnConcatView(dfs ...*DataFrame) (*DataFrame, error)

ColumnConcatView merges the columns from multiple dataframes. For example, if columns(df1)=[col1, col2] and columns(df2)=[col3] then columns(ColumnConcatView(df1, df2)) = [col1, col2, col3] It returns an error if the number of rows or the inner indices don't match. The returned dataframe shares data with the dataframes given as arguments, so changing the input dataframes will also change the returned dataframe. Numpy equivalent: concat(dfs, axis=1)

func ColumnCopyConcat

func ColumnCopyConcat(dfs ...*DataFrame) (*DataFrame, error)

ColumnCopyConcat merges the columns from multiple dataframes. For example, if columns(df1)=[col1, col2] and columns(df2)=[col3] then columns(ColumnCopyConcat(df1, df2)) = [col1, col2, col3] It returns an error if the number of rows don't match. The returned dataframe doesn't share any data with the input dataframe, so the returned dataframe is safe to change. Numpy equivalent: concat(dfs, axis=1)

func ColumnSmartConcat

func ColumnSmartConcat(dfs ...*DataFrame) (*DataFrame, error)

ColumnSmartConcat merges the columns from multiple dataframes. For example, if columns(df1)=[col1, col2] and columns(df2)=[col3] then columns(ColumnSmartConcat(df1, df2)) = [col1, col2, col3] It returns an error if the number of rows don't match. The returned dataframe shares data with the dataframes given as arguments, unless said dataframes' inner indices are not congruent. Use this function if you are not going to change the returned dataframe and want to avoid unnecessary copies when possible. Numpy equivalent: concat(dfs, axis=1)

func EmptyDataFrame

func EmptyDataFrame(nRows int, maxCPU int) *DataFrame

EmptyDataFrame creates a new dataframe with no columns. maxCPU indicates how many CPUs are allowed to be utilized by the functions operating on the dataframe.

func RowConcat

func RowConcat(dfs ...*DataFrame) (*DataFrame, error)

RowConcat concatenates the rows of the given dataframes. All the data is copied, i.e. the returned dataframe does not share any data or structure with the dataframes given as arguments. It returns an error if the dataframes don't have the same columns. Numpy equivalent: concat(dfs, axis=0)

func (*DataFrame) AreIndicesAltered

func (df *DataFrame) AreIndicesAltered() bool

AreIndicesAltered returns true if the internal list of indices is not range(0, df.NumRows())

func (*DataFrame) Bools

func (df *DataFrame) Bools(colName string) BoolAccess

Bools returns an iterator on a given boolean column

func (*DataFrame) ColumnView

func (df *DataFrame) ColumnView(columns ...string) *DataFrame

ColumnView selects a subset of columns.

func (*DataFrame) Copy

func (df *DataFrame) Copy() *DataFrame

Copy returns a deep-copy of everything inside the dataframe, except the objects inside the object columns, despite copying the object slices. Call this function if you want to transform a view into a compact dataframe. Compact dataframes are more efficient, but making a copy can be expensive.

func (*DataFrame) CopyValuesToInterfaces

func (df *DataFrame) CopyValuesToInterfaces(colName string) []interface{}

CopyValuesToInterfaces returns a copy of a column's data packed into an interface slice, regardless of the column's type.

func (*DataFrame) Debug

func (df *DataFrame) Debug(enable bool) *DataFrame

Debug enables or disable the debugging mode. The debugging mode will print out some troubleshooting information via golang's builtin logger. It returns the dataframe itself.

func (*DataFrame) DetachedView

func (df *DataFrame) DetachedView(columns ...string) *DataFrame

DetachedView makes sure that the given columns can be altered without altering the original data from some parent dataframe. It will perform a copy only if the data is shared. This is useful when you execute a function that changes the data in-place:

view := df.DetachedView("height")
view.OverwriteFloats64("height", []float64{173, 174, 162, 185})

Caveat: this can be an expensive action if the data that backs up the dataframe is large, even though the dataframe at hand hasn't many rows.

func (*DataFrame) EmptyMask

func (df *DataFrame) EmptyMask() []bool

EmptyMask returns a possibly pre-allocated mask for the MaskView function. The values of the mask are not initialized and can be either true of false. Intended use:

m := df.EmptyMask()
for i := 0; i < df.NumRows(); i++ {
  m[i] = i % 10
}
df = df.MaskView(m)

Do not concurrently use this function unless you call ThreadSafeMasking(True) first.

func (*DataFrame) Encode

func (df *DataFrame) Encode(newEncoding encoding.Encoding) error

Encode changes the encoding of all all the string columns. It returns an error if it cannot be encoded into the desired encoding, or decoded using the current encoding. If encoding is nil, strings will be encoded in UTF-8.

func (*DataFrame) Floats

func (df *DataFrame) Floats(colName string) FloatAccess

Floats returns an iterator on a given float column

func (*DataFrame) GoodShortNames

func (df *DataFrame) GoodShortNames(minLength int) map[string]string

GoodShortNames returns short versions of column names for PrintRecords(). minLength is the minimum length of shortened names. If minLength is zero or negative, it will default to minLength=3. This function is not deterministic.

func (*DataFrame) HashStringsView

func (df *DataFrame) HashStringsView(columns ...string) *DataFrame

HashStringsView hashes the string columns given as argument, thereby transforming string columns into integer columns. The hashing algorithm always returns the same int if given the same string. Missing strings will be converted to -1. This function is primarily meant to be used as a first step before categorical encoding. HashStringsView is multi-threaded.

func (*DataFrame) IndexView

func (df *DataFrame) IndexView(indices []int) *DataFrame

IndexView builds a view from a selection of rows. The given slice of indices is typically a subset of range(0, df.NumRows()), but it can also be a different order of range(0, df.NumRows()) or a repetition of some indices, thereby making the view larger than its parent dataframe. It is equivalent to x[indices] where x is a Python numpy array, except that IndexView doesn't do any copy.

func (*DataFrame) Ints

func (df *DataFrame) Ints(colName string) IntAccess

Ints returns an iterator on a given integer column

func (*DataFrame) LabelToInt

func (df *DataFrame) LabelToInt(colName string) ([]int, map[string]int)

LabelToInt maps one column's string values to a range of integers starting from zero, and returns both the converted strings and the mapping. For example: conversion(['a', 'b', 'a', 'c']) -> [0, 1, 0, 2] via the mapping a: 0, b: 1, c: 2. If a string is nil, the string will be converted to -1. Use this column to convert classification labels into integers.

func (*DataFrame) MaskView

func (df *DataFrame) MaskView(mask []bool) *DataFrame

MaskView builds a view by masking some rows of the dataframe. To avoid unnecessary allocations, please get a pre-allocated mask from DataFrame.EmptyMask() or DataFrame.ZeroMask(). MaskView is functionally equivalent to:

indices = make([]int, 0)
for i, b := mask {
  if b {
    indices = append(indices, i)
  }
}
maskedView := df.IndexView(indices)

func (*DataFrame) NumRows

func (df *DataFrame) NumRows() int

NumRows returns the number of rows in the dataframe.

func (*DataFrame) Objects

func (df *DataFrame) Objects(colName string) ObjectAccess

Objects returns an iterator on a given object column, including string columns.

func (*DataFrame) OverwriteBools

func (df *DataFrame) OverwriteBools(colName string, values []bool)

OverwriteBools (over)writes the given column with the given values. The given slice is copied, so it can safely be altered after this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Bools(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

func (*DataFrame) OverwriteFloats32

func (df *DataFrame) OverwriteFloats32(colName string, values []float32)

OverwriteFloats32 (over)writes the given column with the given values. The given slice is copied, so it can safely be altered after this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Floats(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, (float64) values[i])
}

func (*DataFrame) OverwriteFloats64

func (df *DataFrame) OverwriteFloats64(colName string, values []float64)

OverwriteFloats64 (over)writes the given column with the given values. The given slice is copied, so it can safely be altered after this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Floats(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

func (*DataFrame) OverwriteInts

func (df *DataFrame) OverwriteInts(colName string, values []int)

OverwriteInts (over)writes the given column with the given values. The given slice is copied, so it can safely be alteredafter this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Ints(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

func (*DataFrame) OverwriteObjects

func (df *DataFrame) OverwriteObjects(colName string, values []interface{},
	objectType ObjectType)

OverwriteObjects (over)writes the given column with the given values. The given slice is copied, so it can safely be altered after this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Objects(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

The third argument is only used if the column doesn't exist and has to be created. It is the only way to mix strings with nil values and yet benefit from dataframe operations specialized for strings such as HashStringsView.

func (*DataFrame) OverwriteStrings

func (df *DataFrame) OverwriteStrings(colName string, values []string)

OverwriteStrings (over)writes the given column with the given values. The given values are always copied, so the given slice can be safely altered after calling this function. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Objects(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

If you need to overwrite strings with missing values, use OverwriteObjects instead.

func (*DataFrame) PrintHead

func (df *DataFrame) PrintHead(n int, floatFormat string) *DataFrame

PrintHead prints the first n rows of the dataframe. If n is negative or if n is greater than the number of rows, it will print all the rows. floatFormat describes how you want floats to be printed, e.g. %.4f floatFormat defaults to %.3f Everything is printed on stdout. Nothing on stderr. PrintHead returns the dataframe itself so you can write

df.PrintSummary().PrintHead(n, "") or df.PrintHead(n, "").PrintSummary()

func (*DataFrame) PrintRecords

func (df *DataFrame) PrintRecords(n int, floatFormat string, shorthands map[string]string) *DataFrame

PrintRecords does the same as PrintHead but prints one line for each row. shorthands maps column names to shorter column names in order to avoid cluttering the output. You can leave it empty, nil, or call GoodShortNames() to get optimally small truncated names.

func (*DataFrame) PrintSummary

func (df *DataFrame) PrintSummary() *DataFrame

PrintSummary prints information about the content of the dataframe, such as the name of the columns and the number of rows. It doesn't print the data. Everything is printed on stdout. Nothing on stderr. PrintSummary returns the dataframe itself so you can write

df.PrintSummary().PrintHead(n, "") or df.PrintHead(n, "").PrintSummary()

func (*DataFrame) ResetIndexView

func (df *DataFrame) ResetIndexView() *DataFrame

ResetIndexView sorts the indices in order to speed up sequential access to the columns, including row iterators and gonum matrices. Speed is the only reason to reset the indices. It's different than pandas' eponymous function because pandas uses indices when concatenating columns, whereas ml-essentials does not. Do not use this function if order matters, as when you rely on the data being shuffled.

func (*DataFrame) ReverseView

func (df *DataFrame) ReverseView() *DataFrame

ReverseView flips the order of the rows.

func (*DataFrame) SampleView

func (df *DataFrame) SampleView(n int, replacement bool) *DataFrame

SampleView randomly samples n rows from the dataframe. Sampling with replacement is not yet supported. Sampling without replacement is functionally equivalent to:

df.ShuffleView().SliceView(0, n)

func (*DataFrame) ShallowCopy

func (df *DataFrame) ShallowCopy() *DataFrame

ShallowCopy copies the dataframe's structure but not the data. In 90% of cases, you would rather use View(), which doesn't even copy the structure up until the structure is modified. ShallowCopy returns a view on the dataframe.

func (*DataFrame) ShuffleView

func (df *DataFrame) ShuffleView() *DataFrame

ShuffleView randomizes the dataframe. This is functionally equivalent to this pseudo-code:

indices = range(0, df.NumRows())
shuffle(indices)
shuffledView = df.IndexView(indices)

If you want ShuffleView to behave deterministically, you need to call rand.Seed(seed) somewhere in your program prior to calling ShuffleView.

func (*DataFrame) SliceView

func (df *DataFrame) SliceView(from int, to int) *DataFrame

SliceView builds a view from a slice of the dataframe from index "from" (included) to index "to" (excluded). If "from" or "to" is negative, the index is relative to the end of the dataframe. For example, -1 points to the last index of the dataframe. If "from" is higher than "to", the row order will be reversed.

func (*DataFrame) SortedView

func (df *DataFrame) SortedView(byColumn string) *DataFrame

SortedView sorts the dataframe by ascending order of the given column. The column can either be a float, an int or a bool column. It will panic if the given column is neither of those. Missing values in integer columns will be treated as '-1'. If called on a bool column, it will put false values first. To sort in descending order, call SortedView(byColumn).ReverseView().

func (*DataFrame) SplitNView

func (df *DataFrame) SplitNView(n int) []*DataFrame

SplitNView evenly divides the dataframe into n parts. It will panic if n is negative and returns nil if n equals zero. It always returns *exactly* n dataframes. As a result, some dataframes might be empty.

func (*DataFrame) SplitTrainTestViews

func (df *DataFrame) SplitTrainTestViews(testingRatio float64) (*DataFrame, *DataFrame)

SplitTrainTestViews returns a training set and a testing set. testingRatio is a number between 0 and 1 such that: testSet.NumRows() * testingRatio = df.NumRows() It will panic if testingRatio is not between 0 and 1 included. SplitTrainTestViews does not shuffle the input dataframe. It is the user's responsibility to shuffle the dataframe prior to splitting it.

func (*DataFrame) SplitView

func (df *DataFrame) SplitView(batchSize int) []*DataFrame

SplitView divides the dataframe into dataframes of *exactly* batchSize rows, except the last batch, which will be smaller if NumRows() % batchSize != 0. It will panic if batchSize is zero or negative.

func (*DataFrame) Strings

func (df *DataFrame) Strings(colName string) StringAccess

Strings returns an iterator on a given string column.

func (*DataFrame) ThreadSafeMasking

func (df *DataFrame) ThreadSafeMasking(enable bool) *DataFrame

ThreadSafeMasking makes the current dataframe and its views safe for masking. It returns the dataframe itself.

func (*DataFrame) To1CSV

func (df *DataFrame) To1CSV(r io.Writer, options CSVWritingSpec) error

To1CSV writes the dataframe in CSV format into the writer given as argument. It returns an error if the writer doesn't allow writing. It also forwards any error raised by golang's builtin CSV writer. To1CSV flushes the writer before returning. This function is not multi-threaded.

func (*DataFrame) ToCSVDir

func (df *DataFrame) ToCSVDir(options CSVWritingSpec, prefix string) ([]string, error)

ToCSVDir writes the dataframe in CSV format to files with the chosen prefix. The prefix includes the directory. Example of prefix: "/tmp/output/result" This will write /tmp/output/result01.csv, /tmp/output/result02.csv etc. The dataframe is split evenly between the files and each file is written separately within their dedicated go routine. It returns an error if one of the files doesn't allow writing or if the output directory does not exist. It also forwards any error raised by golang's builtin CSV writer. Alongside the potential error, ToCSVDir returns the list of files written.

func (*DataFrame) ToCSVFiles

func (df *DataFrame) ToCSVFiles(options CSVWritingSpec, paths ...string) error

ToCSVFiles writes the dataframe in CSV format in the given files. The dataframe is split evenly between the files and each file is written separately within their dedicated go routine. It returns an error if one of the files doesn't allow writing. It also forwards any error raised by golang's builtin CSV writer.

func (*DataFrame) ToCSVs

func (df *DataFrame) ToCSVs(writers []io.Writer, options CSVWritingSpec) error

To1CSV writes the dataframe in CSV format into the writers given as argument. The dataframe is split evenly between the writers and each writer is called separately within their dedicated go routine. The row order is not guaranteed. It returns an error if one of the writers doesn't allow writing. It also forwards any error raised by golang's builtin CSV writer.

func (*DataFrame) TopView

func (df *DataFrame) TopView(byColumn string, n int, ascending bool, sorted bool) *DataFrame

TopView returns the n rows with the lowest values if ascending=true. It returns the rows with the highest values if ascending=false. The values that serve as criteria are the values from the column byColumn. The column can either be a float, an int or a bool column. It will panic if the given column is neither of those. If sorted=true, rows will always be sorted according to the desired order. If sorted=false, rows may or may not be sorted. If n is higher than the total number of rows, if will return all the rows. It will panic if the given column is neither of those. Missing values in integer columns will be treated as '-1'. If called on a bool column, false will be treated as lower than true.

func (*DataFrame) View

func (df *DataFrame) View() *DataFrame

View makes the shallowest copy of the dataframe. It is roughly equivalent to:

copy := *df

Use this function when you want to transform an in-place operation into a view operation, e.g.:

view := df.View()
view.AllocateFloats("height")

func (*DataFrame) ZeroMask

func (df *DataFrame) ZeroMask() []bool

ZeroMask returns a possibly pre-allocated mask for the MaskView function. The values of the mask are all initialized to false. Intended use:

m := df.ZeroMask()
for i := 0; i < df.NumRows(); i++ {
  if i % 10 == 0 {
     m[i] = true
  }
}
df = df.MaskView(m)

Do not concurrently use this function unless you call ThreadSafeMasking(True) first.

type DataFrameInternals

type DataFrameInternals struct {
	DF *DataFrame
}

DataFrameInternals is a helper structure that lets you access the internals of a dataframe. Initialize the structure like that: DataFrameInternals{DF: your_df} Normally not needed, hence the lack of documentation.

func (DataFrameInternals) BoolData

func (dfi DataFrameInternals) BoolData(columnName string) []bool

func (DataFrameInternals) FloatData

func (dfi DataFrameInternals) FloatData(columnName string) []float64

func (DataFrameInternals) GetIndices

func (dfi DataFrameInternals) GetIndices() []int

func (DataFrameInternals) GetMask

func (dfi DataFrameInternals) GetMask() []bool

func (DataFrameInternals) IntData

func (dfi DataFrameInternals) IntData(columnName string) []int

func (DataFrameInternals) ObjectData

func (dfi DataFrameInternals) ObjectData(columnName string) []interface{}

type Dense64Batching

type Dense64Batching struct {
	FloatBatching
	// contains filtered or unexported fields
}

func NewDense64Batching

func NewDense64Batching(columns []string) *Dense64Batching

NewDense64Batching allocates a new Dense64Batching structure. Dense64Batching will copy the columns passed as arguments in the same order as given to this function. Dense64Batching recycles the data between successive calls to DenseMatrix, so try to call NewDense64Batching only once and DenseMatrix as many times as needed.

func (*Dense64Batching) DenseMatrix

func (bat *Dense64Batching) DenseMatrix(df *DataFrame) mat.Matrix

DenseMatrix generates a gonum dense matrix from the given dataframe. The returned data is a copy of the dataframe's data, so changing the matrix doesn't change the dataframe. The returned matrix is stored as a transpose. Any operation on this matrix will be faster if it involves another transpose. Dense64Batching recycles the data between successive calls to DenseMatrix, so try to call NewDense64Batching only once and DenseMatrix as many times as needed.

type Float32Iterator

type Float32Iterator struct {
	RowIterator
	// contains filtered or unexported fields
}

Float32Iterator is a structure to iterate over a dataframe one row at a time. The rows provided to the user will be slices of float32. Float32Iterator cannot iterate through object columns. The delivered rows can be safely changed with no effect on the dataframe.

func NewFloat32Iterator

func NewFloat32Iterator(df *DataFrame, columns []string) *Float32Iterator

NewFloat32Iterator allocates a new row iterator to allow you to iterate over float, bool and int columns as floats. If a given column is not float, bool or int, it will be ignored. Row elements will be delivered in the same order as the columns passed as argument.

func (*Float32Iterator) NextRow

func (ite *Float32Iterator) NextRow() ([]float32, int, int)

NextRow returns a single row, its index in the view and its index in the original data. If there is no more row, it returns nil, the size of the view and the size of the original data. You can safely change the values of the row since they are copies of the original data. However, NextRow recycles the float slice, so you shouldn't store the slice.

type FloatAccess

type FloatAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

FloatAccess is a random-access iterator for float columns.

func (FloatAccess) Get

func (access FloatAccess) Get(row int) float64

Get returns the float value at the given index.

func (FloatAccess) Set

func (access FloatAccess) Set(row int, val float64)

Set overwrites the float value at the given index.

func (FloatAccess) VecDense

func (access FloatAccess) VecDense() *mat.VecDense

VecDense creates a gonum's VecDense object from the dataframe's float data. The data is not copied if the underlying dataframe is contiguous, otherwise the data is copied. Use this function if you are not going to change the returned VecDense and want to avoid an unnecessary copy.

func (FloatAccess) VecDenseCopy

func (access FloatAccess) VecDenseCopy() *mat.VecDense

VecDense creates a gonum's VecDense object from the dataframe's float data. It always copies the data, so you can change the returned VecDense without changing the dataframe.

type FloatBatching

type FloatBatching struct {
	// contains filtered or unexported fields
}

type IntAccess

type IntAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

IntAccess is a random-access iterator for integer columns.

func (IntAccess) Get

func (access IntAccess) Get(row int) int

Get returns the integer at the given index.

func (IntAccess) Set

func (access IntAccess) Set(row int, val int)

Set overwrites the integer value at the given index.

type ObjectAccess

type ObjectAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

ObjectAccess is a random-access iterator for object columns, including string columns.

func (ObjectAccess) Get

func (access ObjectAccess) Get(row int) interface{}

Get returns the object at the given index.

func (ObjectAccess) Set

func (access ObjectAccess) Set(row int, val interface{})

Set overwrites the object at the given index.

type ObjectType

type ObjectType uint8

ObjectType allows us to distinguish between the possible types of data contained in the object columns.

const (
	AnyObject ObjectType = iota
	StringObject
)

type RawData

type RawData struct {
	// contains filtered or unexported fields
}

RawData is the structure that holds the data of the dataframes. RawData has no concept of index view, so it always manipulates columns as contiguous blocks of data. When encapsulated in a DataBuilder, RawData allows the user to temporarily create columns of different size, up until it is converted to a DataFrame.

func FromCSV

func FromCSV(r io.Reader, options CSVReadingSpec) (*RawData, error)

FromCSV reads CSV data and returns a RawData structure with automatically inferred column types. It returns any error returned by golang's builtin CSV reader. With the default options, types are inferred this way: - If the column is 100% made of values that can be parsed as bools (0, 1, true, True, false, False or any other variant), it is stored as a bools. - Otherwise, if it is 100% made of integers or missing values, it is stored as an integer column. Integer missing values are replaced with -1. - Otherwise, if it is 100% made of floats or missing values, it is stored as a float column. Float missing values are replaced with NaN. - If none of the above match, the column is stored as a string column.

func FromCSVFile

func FromCSVFile(path string, options CSVReadingSpec) (*RawData, error)

FromCSVFile reads a CSV file and returns a RawData structure with automatically inferred column types. It returns any error returned by golang's builtin CSV reader. It also returns an error if the file cannot be opened. For the type inference, refer to FromCSV's documentation.

func FromCSVFilePattern

func FromCSVFilePattern(glob string, options CSVReadingSpec) (*RawData, error)

FromCSVFilePattern searches for file paths that matches the given glob pattern, reads them and returns a single RawData structure containing all the data packed in an unordered fashion. It returns any error returned by golang's builtin CSV reader. It also returns an error if any of the matching file can't be opened. If no file can be found, it returns (nil, nil). For the type inference, refer to FromCSV's documentation.

func MergeRawDataColumns

func MergeRawDataColumns(list []*RawData) *RawData

MergeRawDataColumns transfers data from multiple RawData structures. It basically calls TransferRawDataFrom on each RawData passed as argument and it is subject to the same limitations.

func MergeRawDataRows

func MergeRawDataRows(list []*RawData) *RawData

MergeRawDataRows concatenates multiple RawData together in a row-wise manner. RawData can have different columns (not recommended), but be aware that it will panic if you try right away to uprade the resulting RawData to a dataframe. The data will be copied and the given structures won't share data, so altering one of the RawData later won't affect the returned RawData. It is the equivalent of numpy.concat(list, axis=0)

func NewRawData

func NewRawData() *RawData

NewRawData allocates a new RawData structure.

func (*RawData) ActualMaxCPU

func (data *RawData) ActualMaxCPU() int

ActualMaxCPU returns the maximum number of CPUs that are allowed to be utilized by the functions operating on the dataframe. If such a maximum number of CPUs was never set or or if it was set with a number higher than the number of CPU cores on your machine, it will return the number of CPU cores on your machine.

func (*RawData) AllocBools

func (data *RawData) AllocBools(columns ...string)

AllocBools allocates new empty float columns.

func (*RawData) AllocFloats

func (data *RawData) AllocFloats(columns ...string)

AllocFloats allocates new empty float columns.

func (*RawData) AllocInts

func (data *RawData) AllocInts(columns ...string)

AllocInts allocates new empty integer columns.

func (*RawData) AllocObjects

func (data *RawData) AllocObjects(columns ...string)

AllocObjects allocates new empty object columns.

func (*RawData) AllocStrings

func (data *RawData) AllocStrings(columns ...string)

AllocStrings allocates new empty string columns.

func (*RawData) BoolHeader

func (data *RawData) BoolHeader() ColumnHeader

BoolHeader returns a ColumnHeader with all the boolean column names. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) BoolToFloats

func (data *RawData) BoolToFloats(columns ...string)

BoolToFloats converts boolean columns into float columns. Note that this runs as at a view-free level since it wouldn't make sense to convert only parts of dataframe's column, given that mixed types are not allowed for numerical columns.

func (*RawData) CheckConsistency

func (data *RawData) CheckConsistency(t *testing.T) bool

func (*RawData) CreateColumnQueue

func (data *RawData) CreateColumnQueue(columns []string) utils.StringQ

Only used by other ml-essentials' packages.

func (*RawData) Drop

func (data *RawData) Drop(columns ...string)

Drop removes the given columns.

func (*RawData) FloatHeader

func (data *RawData) FloatHeader() ColumnHeader

FloatHeader returns a ColumnHeader with all the float column names. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) Header

func (data *RawData) Header() ColumnHeader

Header returns a ColumnHeader with all the column names. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) IntHeader

func (data *RawData) IntHeader() ColumnHeader

IntHeader returns a ColumnHeader with all the integer column names Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) IntToFloats

func (data *RawData) IntToFloats(columns ...string)

IntToFloats converts integer columns into float columns. Note that this runs as at a view-free level since it wouldn't make sense to convert only parts of dataframe's column, given that mixed types are not allowed for numerical columns.

func (*RawData) NumAllocatedRows

func (data *RawData) NumAllocatedRows() int

NumAllocatedRows returns the total number of allocated rows. If this is called via a pointer on a dataframe, the number of allocated rows can be different than the value returned by NumRows(). This method is not reliable on a RawData under construction, e.g. when its columns are being built via a DataBuilder.

func (*RawData) NumColumns

func (data *RawData) NumColumns() int

NumColumns returns the total number of columns.

func (*RawData) ObjectHeader

func (data *RawData) ObjectHeader() ColumnHeader

ObjectHeader returns a ColumnHeader with all the object column names, including that of string columns. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) PrintUIDs

func (df *RawData) PrintUIDs()

func (*RawData) Rename

func (data *RawData) Rename(oldName string, newName string)

Rename changes the name of a column. The new column will be of the same type and share the same data. For example, if you execute:

df.View().Rename("apples", "oranges").Ints("oranges").Set(0, 42)

It will change df's number of apples to 42 at index=0.

func (*RawData) SetMaxCPU

func (data *RawData) SetMaxCPU(maxCPU int)

SetMaxCPU sets the number of CPUs that are allowed to be utilized by the functions operating on the dataframe and any view on the dataframe. If maxCPU is 0 or negative, Max CPU will be set to the number of CPU cores on your machine.

func (*RawData) StringHeader

func (data *RawData) StringHeader() ColumnHeader

BoolHeader returns a ColumnHeader with all the string column names. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) ToDataFrame

func (data *RawData) ToDataFrame() *DataFrame

ToDataFrame upgrades the RawData structure to a DataFrame. It will panic if the columns are of different size. The data is shared between the original RawData and the returned dataframe, so any change to the RawData will affect the dataframe, and vice versa.

func (*RawData) TransferRawDataFrom

func (data *RawData) TransferRawDataFrom(from *RawData)

TransferRawDataFrom adds the data from another RawData structure. The two structures will share data, so changing one will change the other. Reminder: RawData's functions are ignorant of dataframe indices, so don't expect this function to exlusively transfer viewed data when it's called on a dataframe.

func (*RawData) Unshare

func (data *RawData) Unshare(columns ...string)

Unshare is the in-place, low-level version of DataFrame.DetachView(). For your own sake, please use DataFrame.DetachView() instead.

type RowIterator

type RowIterator struct {
	FloatBatching
	// contains filtered or unexported fields
}

func (*RowIterator) Columns

func (ite *RowIterator) Columns() []string

Columns returns the name of the columns ordered like the row elements are ordered. If called on Float32Iterator or Float64Iterator, it returns the list of columns passed to NewFloat32Iterator and NewFloat64Iterator respectively.

func (*RowIterator) Reset

func (ite *RowIterator) Reset(df *DataFrame, check bool)

Reset recycles the iterator's pre-allocated data for another dataframe with the same columns. If check is true, it will be verified that the columns are the same.

type StringAccess

type StringAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

StringAccess is a random-access iterator for string columns

func (StringAccess) Get

func (access StringAccess) Get(row int) string

Get returns the string at the given index.

func (StringAccess) Set

func (access StringAccess) Set(row int, val string)

Set overwrites the string at the given index.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL