dataframe

package

v0.1.0 Latest Latest Go to latest Published: Feb 2, 2021 License: Apache-2.0 Imports: 19 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/rom1mouret/ml-essentials

Links

Open Source Insights

README ¶

This README covers the basics.

GoDoc

link

Imports

import "github.com/rom1mouret/ml-essentials/dataframe"

DataFrame construction

DataFrames accept 4 types of columns.

type	missing value	comment
float64	NaN
int	-1	meant to store categorical values
bool	not supported
interface{}	nil	called "object" columns

Strings are stored in the interface{} columns. ml-essentials distinguishes between regular object columns and string columns by keeping around the names of the string columns. Some functions are specialized for string columns, e.g. Encode(newEncoding encoding.Encoding).

Storing categorical values is the preferred use of integer columns. That said, you are free to use them to store any kind of integers, including negative integers. Negative integers won't be treated as missing values unless you run IntImputer.

Construction with a DataBuilder

builder := dataframe.DataBuilder{RawData: dataframe.NewRawData()}
builder.AddFloats("height", 170, 180, 165)
builder.AddStrings("name", "Karen", "John", "Sophie")
df := builder.ToDataFrame()
df.PrintSummary().PrintHead(-1, "%.3f")

Construction from a CSV file

spec := dataframe.CSVReadingSpec{
  MaxCPU: -1,
  MissingValues: []string{"", " ", "NA","-"},
  IntAsFloat: true,
  BoolAsFloat: false,
  BinaryAsFloat: true,
}
rawdata, err := dataframe.FromCSVFile("/path/to/csvfile.csv", spec)

or

rawdata, err := dataframe.FromCSVFilePattern("/path/to/csvdir/*.csv", spec)

Column names

You can manipulate column names via the ColumnHeader structure.

h := df.FloatHeader().And(df.IntHeader()).Except("target", "id").NameList()

Iterate over a dataframe

Option 1: ColumnAccess

height := df.Floats("height")
for i := 0; i < height.Size(); i++ {
  height.Set(i, height.Get(i) / 2)
}

Option 2: Gonum Batching

batching := dataframe.NewDense64Batching([]string{"age", "height", "gender"})
for _, batch := range df.SplitView(params.BatchSize) {
  // get a gonum matrix with columns age, height and gender (in that order)
  rows := batching.DenseMatrix(batch)
}

Option 3: Row Iterator

iterator := NewFloat32Iterator(df, []string{"age", "height", "gender"})
for row, rowIdx, _ := iterator.NextRow(); row != nil; row, rowIdx, _ = iterator.NextRow() {
  // row is a float32 slice
}

Views

Views are dataframes that share data with other dataframes. There is no View type and DataFrame type. Both are of type DataFrame. Quick example:

view := df.ShuffleView()
view.OverwriteInts("level", []int{4, 1, 2, 1})

Here view shares its data with df. This is useful in two ways. First, ShuffleView doesn't copy the data, thus it is fast and memory-efficient. Second, it allows you to overwrite df's data from anywhere in your program. The "side effects" section explains why this is an advantage when it comes to handling indexed data.

If you want to avoid such side effects, you can detach the view from its parent dataframe.

view := df.ShuffleView().DetachedView("level")
view.OverwriteInts("level", []int{4, 1, 2, 1})

Now, OverwriteInts does not alter df because view has its own level data. Other columns of df remain shared.

ml-essentials provides a variety of functions to manage data copies at a fine-grained level.

View < TransferRawDataFrom < ShallowCopy < Unshare < DetachedView < Copy

On one side of the spectrum, View only copies pointers. On the opposite side, Copy copies almost everything. View, DetachedView and Copy cover 99% of the cases.

View is handy if you want to execute an in-place operation without altering the original dataframe, as in this example:

view := df.View()
view.Rename("level", "degree")

Now, view and df still share their data, but their columns are named differently.

Side effects

Side effects are normally considered anti-patterns but they do facilitate manipulating indexed data. For instance, consider this scenario:

at the top level, the data is separated into "features" and "metadata". Example of metadata: unique identifier, timestamps.
the model makes predictions from the features and predictions with low confidence are thrown away.
back to the top level, we combine "metadata" columns with predictions using the indices of high-confidence rows.

Step 3 is error-prone. With ml-essentials, the idiomatic way is to avoid separating "features" and "metadata" in the first place. Instead, we would rely on views to enforce that the metadata always aligns with the features and predicted values.

Among the way Pandas can solve this problem, it can combine "features" and "metadata" in an index-aware fashion, but this makes pandas.concat error-prone in other scenarios, like when it fills dataframes with NaN where indices don't align, that is if ignore_index is left to its default value.

Filtering, masking and indexing

Unlike Pandas and Numpy, there is no syntactic sugar to create masks and index arrays. Sugar aside, this section will look familiar to Pandas and Numpy users.

If you want to filter rows where "age" is over 18, you can do so with MaskView:

ages := df.Floats("age")
mask := df.EmptyMask()
for i := 0; i < ages.Size(); i++ {
  mask[i] = ages.Get(i) >= 18
}
view := df.MaskView(mask)

Getting a mask from EmptyMask() is advantageous because it recycles []bool slices across dataframes, but it is not mandatory.

Equivalent filtering with IndexView:

ages := df.Floats("age")
indices := make([]int, 0, ages.Size())
for i := 0; i < ages.Size(); i++ {
  if ages.Get(i) >= 18 {
    indices = append(indices, i)
  }
}
view := df.IndexView(indices)

In the future, we may add syntactic sugar for common scenarios, e.g. Condition("age").Higher(18).

Write in a dataframe

You can use the Set function as shown above. Alternatively, you might find it more convenient to write an entire column in one line of code:

df.OverwriteFloats64("height", []float64{170, 180, 165})

This is almost the same as:

height := df.Floats("height")
height.Set(0, 170)
height.Set(1, 180)
height.Set(2, 165)

The only difference is that OverwriteFloats64 will create a new column if it doesn't already exist.

Complete example

This is an example taken from linear_regression.go

import (
  "gonum.org/v1/gonum/mat"
  "github.com/rom1mouret/ml-essentials/dataframe"
)

func Predict(df *dataframe.DataFrame, batchSize int, resultColumn string) *dataframe.DataFrame {
  df = df.ResetIndexView() // makes batching.DenseMatrix faster

  // pre-allocation
  weights := mat.NewVecDense(len(reg.Weights), reg.Weights)
  pred := make([]float64, df.NumRows())

  // prediction
  batching := dataframe.NewDense64Batching(reg.Features)
  for i, batch := range df.SplitView(batchSize) {
    rows := batching.DenseMatrix(batch)
    offset := i * batchSize
    yData := pred[offset:offset+batch.NumRows()]
    yVec := mat.NewVecDense(len(yData), yData)
    yVec.MulVec(rows, weights)
  }

  // write the result in the output dataframe
  result := df.View()
  result.OverwriteFloats64("_target", pred)
  reg.TargetScaler.InverseTransformInplace(result)
  result.Rename("_target", resultColumn)

  return result
}

Documentation ¶

Index ¶

func CheckNoColumnOverlap(dfs []*DataFrame) error
type BoolAccess
- func (access BoolAccess) Get(row int) bool
- func (access BoolAccess) Set(row int, val bool)
type CSVReadingSpec
type CSVWritingSpec
type ColumnAccess
- func (access ColumnAccess) SharedIndex(localIndex int) int
- func (access ColumnAccess) Size() int
type ColumnHeader
- func Columns(names ...string) ColumnHeader
- func (h ColumnHeader) And(others ...ColumnHeader) ColumnHeader
- func (h ColumnHeader) Copy() ColumnHeader
- func (h ColumnHeader) Except(columns ...string) ColumnHeader
- func (h ColumnHeader) ExceptHeader(others ...ColumnHeader) ColumnHeader
- func (h ColumnHeader) NameList() []string
- func (h ColumnHeader) NameSet() map[string]bool
- func (h ColumnHeader) Num() int
type DataBuilder
- func (builder DataBuilder) AddBools(col string, values ...bool) DataBuilder
- func (builder DataBuilder) AddFloats(col string, values ...float64) DataBuilder
- func (builder DataBuilder) AddInts(col string, values ...int) DataBuilder
- func (builder DataBuilder) AddObjects(col string, values ...interface{}) DataBuilder
- func (builder DataBuilder) AddStrings(col string, values ...string) DataBuilder
- func (builder DataBuilder) MarkAsString(col string) DataBuilder
- func (builder DataBuilder) SetBools(col string, values []bool) DataBuilder
- func (builder DataBuilder) SetFloats(col string, values []float64) DataBuilder
- func (builder DataBuilder) SetInts(col string, values []int) DataBuilder
- func (builder DataBuilder) SetObjects(col string, values []interface{}) DataBuilder
- func (builder DataBuilder) TextEncoding(encoding encoding.Encoding) DataBuilder
- func (builder DataBuilder) ToDataFrame() *DataFrame
type DataFrame
- func ColumnConcatView(dfs ...*DataFrame) (*DataFrame, error)
- func ColumnCopyConcat(dfs ...*DataFrame) (*DataFrame, error)
- func ColumnSmartConcat(dfs ...*DataFrame) (*DataFrame, error)
- func EmptyDataFrame(nRows int, maxCPU int) *DataFrame
- func RowConcat(dfs ...*DataFrame) (*DataFrame, error)
- func (df *DataFrame) AreIndicesAltered() bool
- func (df *DataFrame) Bools(colName string) BoolAccess
- func (df *DataFrame) ColumnView(columns ...string) *DataFrame
- func (df *DataFrame) Copy() *DataFrame
- func (df *DataFrame) CopyValuesToInterfaces(colName string) []interface{}
- func (df *DataFrame) Debug(enable bool) *DataFrame
- func (df *DataFrame) DetachedView(columns ...string) *DataFrame
- func (df *DataFrame) EmptyMask() []bool
- func (df *DataFrame) Encode(newEncoding encoding.Encoding) error
- func (df *DataFrame) Floats(colName string) FloatAccess
- func (df *DataFrame) GoodShortNames(minLength int) map[string]string
- func (df *DataFrame) HashStringsView(columns ...string) *DataFrame
- func (df *DataFrame) IndexView(indices []int) *DataFrame
- func (df *DataFrame) Ints(colName string) IntAccess
- func (df *DataFrame) LabelToInt(colName string) ([]int, map[string]int)
- func (df *DataFrame) MaskView(mask []bool) *DataFrame
- func (df *DataFrame) NumRows() int
- func (df *DataFrame) Objects(colName string) ObjectAccess
- func (df *DataFrame) OverwriteBools(colName string, values []bool)
- func (df *DataFrame) OverwriteFloats32(colName string, values []float32)
- func (df *DataFrame) OverwriteFloats64(colName string, values []float64)
- func (df *DataFrame) OverwriteInts(colName string, values []int)
- func (df *DataFrame) OverwriteObjects(colName string, values []interface{}, objectType ObjectType)
- func (df *DataFrame) OverwriteStrings(colName string, values []string)
- func (df *DataFrame) PrintHead(n int, floatFormat string) *DataFrame
- func (df *DataFrame) PrintRecords(n int, floatFormat string, shorthands map[string]string) *DataFrame
- func (df *DataFrame) PrintSummary() *DataFrame
- func (df *DataFrame) ResetIndexView() *DataFrame
- func (df *DataFrame) ReverseView() *DataFrame
- func (df *DataFrame) SampleView(n int, replacement bool) *DataFrame
- func (df *DataFrame) ShallowCopy() *DataFrame
- func (df *DataFrame) ShuffleView() *DataFrame
- func (df *DataFrame) SliceView(from int, to int) *DataFrame
- func (df *DataFrame) SortedView(byColumn string) *DataFrame
- func (df *DataFrame) SplitNView(n int) []*DataFrame
- func (df *DataFrame) SplitTrainTestViews(testingRatio float64) (*DataFrame, *DataFrame)
- func (df *DataFrame) SplitView(batchSize int) []*DataFrame
- func (df *DataFrame) Strings(colName string) StringAccess
- func (df *DataFrame) ThreadSafeMasking(enable bool) *DataFrame
- func (df *DataFrame) To1CSV(r io.Writer, options CSVWritingSpec) error
- func (df *DataFrame) ToCSVDir(options CSVWritingSpec, prefix string) ([]string, error)
- func (df *DataFrame) ToCSVFiles(options CSVWritingSpec, paths ...string) error
- func (df *DataFrame) ToCSVs(writers []io.Writer, options CSVWritingSpec) error
- func (df *DataFrame) TopView(byColumn string, n int, ascending bool, sorted bool) *DataFrame
- func (df *DataFrame) View() *DataFrame
- func (df *DataFrame) ZeroMask() []bool
type DataFrameInternals
- func (dfi DataFrameInternals) BoolData(columnName string) []bool
- func (dfi DataFrameInternals) FloatData(columnName string) []float64
- func (dfi DataFrameInternals) GetIndices() []int
- func (dfi DataFrameInternals) GetMask() []bool
- func (dfi DataFrameInternals) IntData(columnName string) []int
- func (dfi DataFrameInternals) ObjectData(columnName string) []interface{}
type Dense64Batching
- func NewDense64Batching(columns []string) *Dense64Batching
- func (bat *Dense64Batching) DenseMatrix(df *DataFrame) mat.Matrix
type Float32Iterator
- func NewFloat32Iterator(df *DataFrame, columns []string) *Float32Iterator
- func (ite *Float32Iterator) NextRow() ([]float32, int, int)
type FloatAccess
- func (access FloatAccess) Get(row int) float64
- func (access FloatAccess) Set(row int, val float64)
- func (access FloatAccess) VecDense() *mat.VecDense
- func (access FloatAccess) VecDenseCopy() *mat.VecDense
type FloatBatching
type IntAccess
- func (access IntAccess) Get(row int) int
- func (access IntAccess) Set(row int, val int)
type ObjectAccess
- func (access ObjectAccess) Get(row int) interface{}
- func (access ObjectAccess) Set(row int, val interface{})
type ObjectType
type RawData
- func FromCSV(r io.Reader, options CSVReadingSpec) (*RawData, error)
- func FromCSVFile(path string, options CSVReadingSpec) (*RawData, error)
- func FromCSVFilePattern(glob string, options CSVReadingSpec) (*RawData, error)
- func MergeRawDataColumns(list []*RawData) *RawData
- func MergeRawDataRows(list []*RawData) *RawData
- func NewRawData() *RawData
- func (data *RawData) ActualMaxCPU() int
- func (data *RawData) AllocBools(columns ...string)
- func (data *RawData) AllocFloats(columns ...string)
- func (data *RawData) AllocInts(columns ...string)
- func (data *RawData) AllocObjects(columns ...string)
- func (data *RawData) AllocStrings(columns ...string)
- func (data *RawData) BoolHeader() ColumnHeader
- func (data *RawData) BoolToFloats(columns ...string)
- func (data *RawData) CheckConsistency(t *testing.T) bool
- func (data *RawData) CreateColumnQueue(columns []string) utils.StringQ
- func (data *RawData) Drop(columns ...string)
- func (data *RawData) FloatHeader() ColumnHeader
- func (data *RawData) Header() ColumnHeader
- func (data *RawData) IntHeader() ColumnHeader
- func (data *RawData) IntToFloats(columns ...string)
- func (data *RawData) NumAllocatedRows() int
- func (data *RawData) NumColumns() int
- func (data *RawData) ObjectHeader() ColumnHeader
- func (df *RawData) PrintUIDs()
- func (data *RawData) Rename(oldName string, newName string)
- func (data *RawData) SetMaxCPU(maxCPU int)
- func (data *RawData) StringHeader() ColumnHeader
- func (data *RawData) ToDataFrame() *DataFrame
- func (data *RawData) TransferRawDataFrom(from *RawData)
- func (data *RawData) Unshare(columns ...string)
type RowIterator
- func (ite *RowIterator) Columns() []string
- func (ite *RowIterator) Reset(df *DataFrame, check bool)
type StringAccess
- func (access StringAccess) Get(row int) string
- func (access StringAccess) Set(row int, val string)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CheckNoColumnOverlap ¶

func CheckNoColumnOverlap(dfs []*DataFrame) error

CheckNoColumnOverlap returns an error if two or more dataframes have one or more columns in common, regardless of their type. It returns nil if there is no overlap.

Types ¶

type BoolAccess ¶

type BoolAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

BoolAccess is a random-access iterator for boolean columns.

func (BoolAccess) Get ¶

func (access BoolAccess) Get(row int) bool

Get returns the boolean value at the given index.

func (BoolAccess) Set ¶

func (access BoolAccess) Set(row int, val bool)

Set overwrites the boolean value at the given index.

type CSVReadingSpec ¶

type CSVReadingSpec struct {
	// This is to multi-thread the type conversions.
	// Zero and negative values mean ALL cpus on your machine.
	// The created RawData will also inherit from this value.
	MaxCPU int

	// Optional header if the CSV has no header.
	Header []string

	// Columns to exclude.
	Exclude []string

	// List of string literals that will be interpreted as missing values.
	MissingValues []string

	// Read integers and/or bool as floats.
	IntAsFloat    bool
	BoolAsFloat   bool // 'true', 'false', '0' and '1' converted to 0.0 and 1.0
	BinaryAsFloat bool // '0' and '1' converted to 0.0 and 1.0

	// How the CSV is encoded.
	// if not provided, it will ignore the encoding and fallback to UTF-8 if a
	// conversion is needed.
	// Note:
	// - the CSV is not decoded at reading.
	// - you can run nearly every function of ml-essentials without ever knowing
	//   the encoding
	Encoding encoding.Encoding

	// Options from https://golang.org/src/encoding/csv/reader.go
	Comma            rune
	Comment          rune
	LazyQuotes       bool
	TrimLeadingSpace bool
}

type CSVWritingSpec ¶

type CSVWritingSpec struct {
	// missing values will be replaced with this string. Default: ""
	StringMissingMarker string
	// Value used by ToCSVDir when splitting the dataframe into multiple files.
	// If there are more rows than MinRowsPerFile, it will be split depending on
	// the MaxCPU attached to the dataframe.
	// Default: 512 rows
	MinRowsPerFile int
	// If maintaining the row order is not important, I encourage you to set this
	// value to False.
	MaintainOrder bool
	// Options from https://golang.org/src/encoding/csv/writer.go
	Comma   rune // Field delimiter (set to ',' by NewWriter)
	UseCRLF bool // True to use \r\n as the line terminator

}

type ColumnAccess ¶

type ColumnAccess struct {
	// contains filtered or unexported fields
}

func (ColumnAccess) SharedIndex ¶

func (access ColumnAccess) SharedIndex(localIndex int) int

SharedIndex returns the index to the backing data, given a index to the column. You probably don't need this function.

func (ColumnAccess) Size ¶

func (access ColumnAccess) Size() int

Size returns the length of the column.

type ColumnHeader ¶

type ColumnHeader struct {
	// contains filtered or unexported fields
}

ColumnHeader helps you manipulate column names

func Columns ¶

func Columns(names ...string) ColumnHeader

Columns create a ColumnHeader from a list of column names.

func (ColumnHeader) And ¶

func (h ColumnHeader) And(others ...ColumnHeader) ColumnHeader

And add all the columns from the given other ColumnHeaders. It returns a shallow-copy of itself, not an entirely new ColumnHeader.

func (ColumnHeader) Copy ¶

func (h ColumnHeader) Copy() ColumnHeader

Copy returns a deep-copy of the ColumnHeader

func (ColumnHeader) Except ¶

func (h ColumnHeader) Except(columns ...string) ColumnHeader

Except removes all the columns given as arguments. It returns a shallow-copy of itself, not an entirely new ColumnHeader.

func (ColumnHeader) ExceptHeader ¶

func (h ColumnHeader) ExceptHeader(others ...ColumnHeader) ColumnHeader

Except removes all columns from other ColumnHeaders. It returns a shallow-copy of itself, not an entirely new ColumnHeader.

func (ColumnHeader) NameList ¶

func (h ColumnHeader) NameList() []string

NameList returns the list of columns in the header. Altering the returned slice won't alter ColumnHeader.

func (ColumnHeader) NameSet ¶

func (h ColumnHeader) NameSet() map[string]bool

NameSet returns the set of columns in the header for read-only access. This is faster than NameList()

func (ColumnHeader) Num ¶

func (h ColumnHeader) Num() int

Num returns the number of columns in the ColumnHeader

type DataBuilder ¶

type DataBuilder struct {
	RawData *RawData
}

DataBuilder is a helper structure to build dataframes. Use dataframe.DataBuilder{RawData: dataframe.EmptyRawData()} to initialize it

func (DataBuilder) AddBools ¶

func (builder DataBuilder) AddBools(col string, values ...bool) DataBuilder

AddFloats adds a list of bools to the given boolean column. It returns a shallow copy of itself.

func (DataBuilder) AddFloats ¶

func (builder DataBuilder) AddFloats(col string, values ...float64) DataBuilder

AddFloats adds a list of floats to the given float column. It returns a shallow copy of itself.

func (DataBuilder) AddInts ¶

func (builder DataBuilder) AddInts(col string, values ...int) DataBuilder

AddInts adds a list of ints to the given int column. It returns a shallow copy of itself.

func (DataBuilder) AddObjects ¶

func (builder DataBuilder) AddObjects(col string, values ...interface{}) DataBuilder

AddObjects adds a list of objects to the given object column. It returns a shallow copy of itself. You can use this function to add strings too.

func (DataBuilder) AddStrings ¶

func (builder DataBuilder) AddStrings(col string, values ...string) DataBuilder

AddStrings adds a list of strings to the given object column. It returns a shallow copy of itself. If you need to add nils (= missing value), use AddObjects(col, ...) followded by MarkAsString(col).

func (DataBuilder) MarkAsString ¶

func (builder DataBuilder) MarkAsString(col string) DataBuilder

MarkAsString tags a given object column as a string-only column. This gives access to functionalities that generic object columns don't have.

func (DataBuilder) SetBools ¶

func (builder DataBuilder) SetBools(col string, values []bool) DataBuilder

SetBools adds or replaces the values of the given boolean column. Values are not copied, so if you change them it will change them everywhere. It returns a shallow copy of itself.

func (DataBuilder) SetFloats ¶

func (builder DataBuilder) SetFloats(col string, values []float64) DataBuilder

SetFloats adds or replaces the values of the given float column. Values are not copied, so if you change them it will change them everywhere. It returns a shallow copy of itself.

func (DataBuilder) SetInts ¶

func (builder DataBuilder) SetInts(col string, values []int) DataBuilder

SetInts adds or replaces the values of the given integer column. Values are not copied, so if you change them it will change them everywhere. It returns a shallow copy of itself.

func (DataBuilder) SetObjects ¶

func (builder DataBuilder) SetObjects(col string, values []interface{}) DataBuilder

SetObjects adds or replaces the values of the given object column. Values are not copied, so if you change them it will change them everywhere. It returns a shallow copy of itself. If you want to set a slice of strings, you'll need to convert the slice to a slice of interfaces and call MarkAsString(col).

func (DataBuilder) TextEncoding ¶

func (builder DataBuilder) TextEncoding(encoding encoding.Encoding) DataBuilder

TextEncoding informs ml-essential that the strings that you have provided are encoded in the given encoding. If this function is never called or if nil is passed as argument, it will be assumed that all the strings are utf8-encoded. Even if the strings are not utf8-encoded, it is not mandatory to call this function since encoding is rarely ever used by ml-essentials. TextEncoding returns a shallow copy of itself.

func (DataBuilder) ToDataFrame ¶

func (builder DataBuilder) ToDataFrame() *DataFrame

ToDataFrame() creates a dataframe out of the RawData object. It will panic if the columns are of different size. The returned dataframe shares its data and structure with the encapsulated rawdata.

type DataFrame ¶

type DataFrame struct {
	RawData
	// contains filtered or unexported fields
}

DataFrame is a structure that lets you manipulate both original data and views on other dataframes' data by sharing the underlying data. The data is ordered by column.

func ColumnConcatView ¶

func ColumnConcatView(dfs ...*DataFrame) (*DataFrame, error)

ColumnConcatView merges the columns from multiple dataframes. For example, if columns(df1)=[col1, col2] and columns(df2)=[col3] then columns(ColumnConcatView(df1, df2)) = [col1, col2, col3] It returns an error if the number of rows or the inner indices don't match. The returned dataframe shares data with the dataframes given as arguments, so changing the input dataframes will also change the returned dataframe. Numpy equivalent: concat(dfs, axis=1)

func ColumnCopyConcat ¶

func ColumnCopyConcat(dfs ...*DataFrame) (*DataFrame, error)

ColumnCopyConcat merges the columns from multiple dataframes. For example, if columns(df1)=[col1, col2] and columns(df2)=[col3] then columns(ColumnCopyConcat(df1, df2)) = [col1, col2, col3] It returns an error if the number of rows don't match. The returned dataframe doesn't share any data with the input dataframe, so the returned dataframe is safe to change. Numpy equivalent: concat(dfs, axis=1)

func ColumnSmartConcat ¶

func ColumnSmartConcat(dfs ...*DataFrame) (*DataFrame, error)

ColumnSmartConcat merges the columns from multiple dataframes. For example, if columns(df1)=[col1, col2] and columns(df2)=[col3] then columns(ColumnSmartConcat(df1, df2)) = [col1, col2, col3] It returns an error if the number of rows don't match. The returned dataframe shares data with the dataframes given as arguments, unless said dataframes' inner indices are not congruent. Use this function if you are not going to change the returned dataframe and want to avoid unnecessary copies when possible. Numpy equivalent: concat(dfs, axis=1)

func EmptyDataFrame ¶

func EmptyDataFrame(nRows int, maxCPU int) *DataFrame

EmptyDataFrame creates a new dataframe with no columns. maxCPU indicates how many CPUs are allowed to be utilized by the functions operating on the dataframe.

func RowConcat ¶

func RowConcat(dfs ...*DataFrame) (*DataFrame, error)

RowConcat concatenates the rows of the given dataframes. All the data is copied, i.e. the returned dataframe does not share any data or structure with the dataframes given as arguments. It returns an error if the dataframes don't have the same columns. Numpy equivalent: concat(dfs, axis=0)

func (*DataFrame) AreIndicesAltered ¶

func (df *DataFrame) AreIndicesAltered() bool

AreIndicesAltered returns true if the internal list of indices is not range(0, df.NumRows())

func (*DataFrame) Bools ¶

func (df *DataFrame) Bools(colName string) BoolAccess

Bools returns an iterator on a given boolean column

func (*DataFrame) ColumnView ¶

func (df *DataFrame) ColumnView(columns ...string) *DataFrame

ColumnView selects a subset of columns.

func (*DataFrame) Copy ¶

func (df *DataFrame) Copy() *DataFrame

Copy returns a deep-copy of everything inside the dataframe, except the objects inside the object columns, despite copying the object slices. Call this function if you want to transform a view into a compact dataframe. Compact dataframes are more efficient, but making a copy can be expensive.

func (*DataFrame) CopyValuesToInterfaces ¶

func (df *DataFrame) CopyValuesToInterfaces(colName string) []interface{}

CopyValuesToInterfaces returns a copy of a column's data packed into an interface slice, regardless of the column's type.

func (*DataFrame) Debug ¶

func (df *DataFrame) Debug(enable bool) *DataFrame

Debug enables or disable the debugging mode. The debugging mode will print out some troubleshooting information via golang's builtin logger. It returns the dataframe itself.

func (*DataFrame) DetachedView ¶

func (df *DataFrame) DetachedView(columns ...string) *DataFrame

DetachedView makes sure that the given columns can be altered without altering the original data from some parent dataframe. It will perform a copy only if the data is shared. This is useful when you execute a function that changes the data in-place:

view := df.DetachedView("height")
view.OverwriteFloats64("height", []float64{173, 174, 162, 185})

Caveat: this can be an expensive action if the data that backs up the dataframe is large, even though the dataframe at hand hasn't many rows.

func (*DataFrame) EmptyMask ¶

func (df *DataFrame) EmptyMask() []bool

EmptyMask returns a possibly pre-allocated mask for the MaskView function. The values of the mask are not initialized and can be either true of false. Intended use:

m := df.EmptyMask()
for i := 0; i < df.NumRows(); i++ {
  m[i] = i % 10
}
df = df.MaskView(m)

Do not concurrently use this function unless you call ThreadSafeMasking(True) first.

func (*DataFrame) Encode ¶

func (df *DataFrame) Encode(newEncoding encoding.Encoding) error

Encode changes the encoding of all all the string columns. It returns an error if it cannot be encoded into the desired encoding, or decoded using the current encoding. If encoding is nil, strings will be encoded in UTF-8.

func (*DataFrame) Floats ¶

func (df *DataFrame) Floats(colName string) FloatAccess

Floats returns an iterator on a given float column

func (*DataFrame) GoodShortNames ¶

func (df *DataFrame) GoodShortNames(minLength int) map[string]string

GoodShortNames returns short versions of column names for PrintRecords(). minLength is the minimum length of shortened names. If minLength is zero or negative, it will default to minLength=3. This function is not deterministic.

func (*DataFrame) HashStringsView ¶

func (df *DataFrame) HashStringsView(columns ...string) *DataFrame

HashStringsView hashes the string columns given as argument, thereby transforming string columns into integer columns. The hashing algorithm always returns the same int if given the same string. Missing strings will be converted to -1. This function is primarily meant to be used as a first step before categorical encoding. HashStringsView is multi-threaded.

func (*DataFrame) IndexView ¶

func (df *DataFrame) IndexView(indices []int) *DataFrame

IndexView builds a view from a selection of rows. The given slice of indices is typically a subset of range(0, df.NumRows()), but it can also be a different order of range(0, df.NumRows()) or a repetition of some indices, thereby making the view larger than its parent dataframe. It is equivalent to x[indices] where x is a Python numpy array, except that IndexView doesn't do any copy.

func (*DataFrame) Ints ¶

func (df *DataFrame) Ints(colName string) IntAccess

Ints returns an iterator on a given integer column

func (*DataFrame) LabelToInt ¶

func (df *DataFrame) LabelToInt(colName string) ([]int, map[string]int)

LabelToInt maps one column's string values to a range of integers starting from zero, and returns both the converted strings and the mapping. For example: conversion(['a', 'b', 'a', 'c']) -> [0, 1, 0, 2] via the mapping a: 0, b: 1, c: 2. If a string is nil, the string will be converted to -1. Use this column to convert classification labels into integers.

func (*DataFrame) MaskView ¶

func (df *DataFrame) MaskView(mask []bool) *DataFrame

MaskView builds a view by masking some rows of the dataframe. To avoid unnecessary allocations, please get a pre-allocated mask from DataFrame.EmptyMask() or DataFrame.ZeroMask(). MaskView is functionally equivalent to:

indices = make([]int, 0)
for i, b := mask {
  if b {
    indices = append(indices, i)
  }
}
maskedView := df.IndexView(indices)

func (*DataFrame) NumRows ¶

func (df *DataFrame) NumRows() int

NumRows returns the number of rows in the dataframe.

func (*DataFrame) Objects ¶

func (df *DataFrame) Objects(colName string) ObjectAccess

Objects returns an iterator on a given object column, including string columns.

func (*DataFrame) OverwriteBools ¶

func (df *DataFrame) OverwriteBools(colName string, values []bool)

OverwriteBools (over)writes the given column with the given values. The given slice is copied, so it can safely be altered after this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Bools(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

func (*DataFrame) OverwriteFloats32 ¶

func (df *DataFrame) OverwriteFloats32(colName string, values []float32)

OverwriteFloats32 (over)writes the given column with the given values. The given slice is copied, so it can safely be altered after this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Floats(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, (float64) values[i])
}

func (*DataFrame) OverwriteFloats64 ¶

func (df *DataFrame) OverwriteFloats64(colName string, values []float64)

OverwriteFloats64 (over)writes the given column with the given values. The given slice is copied, so it can safely be altered after this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Floats(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

func (*DataFrame) OverwriteInts ¶

func (df *DataFrame) OverwriteInts(colName string, values []int)

OverwriteInts (over)writes the given column with the given values. The given slice is copied, so it can safely be alteredafter this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Ints(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

func (*DataFrame) OverwriteObjects ¶

func (df *DataFrame) OverwriteObjects(colName string, values []interface{},
	objectType ObjectType)

OverwriteObjects (over)writes the given column with the given values. The given slice is copied, so it can safely be altered after this call. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Objects(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

The third argument is only used if the column doesn't exist and has to be created. It is the only way to mix strings with nil values and yet benefit from dataframe operations specialized for strings such as HashStringsView.

func (*DataFrame) OverwriteStrings ¶

func (df *DataFrame) OverwriteStrings(colName string, values []string)

OverwriteStrings (over)writes the given column with the given values. The given values are always copied, so the given slice can be safely altered after calling this function. If the column doesn't exist, it will create a new column. Otherwise, it is functionally equivalent to:

access := df.Objects(colName)
for i := 0; i < len(values); i++ {
  access.Set(i, values[i])
}

If you need to overwrite strings with missing values, use OverwriteObjects instead.

func (*DataFrame) PrintHead ¶

func (df *DataFrame) PrintHead(n int, floatFormat string) *DataFrame

PrintHead prints the first n rows of the dataframe. If n is negative or if n is greater than the number of rows, it will print all the rows. floatFormat describes how you want floats to be printed, e.g. %.4f floatFormat defaults to %.3f Everything is printed on stdout. Nothing on stderr. PrintHead returns the dataframe itself so you can write

df.PrintSummary().PrintHead(n, "") or df.PrintHead(n, "").PrintSummary()

func (*DataFrame) PrintRecords ¶

func (df *DataFrame) PrintRecords(n int, floatFormat string, shorthands map[string]string) *DataFrame

PrintRecords does the same as PrintHead but prints one line for each row. shorthands maps column names to shorter column names in order to avoid cluttering the output. You can leave it empty, nil, or call GoodShortNames() to get optimally small truncated names.

func (*DataFrame) PrintSummary ¶

func (df *DataFrame) PrintSummary() *DataFrame

PrintSummary prints information about the content of the dataframe, such as the name of the columns and the number of rows. It doesn't print the data. Everything is printed on stdout. Nothing on stderr. PrintSummary returns the dataframe itself so you can write

df.PrintSummary().PrintHead(n, "") or df.PrintHead(n, "").PrintSummary()

func (*DataFrame) ResetIndexView ¶

func (df *DataFrame) ResetIndexView() *DataFrame

ResetIndexView sorts the indices in order to speed up sequential access to the columns, including row iterators and gonum matrices. Speed is the only reason to reset the indices. It's different than pandas' eponymous function because pandas uses indices when concatenating columns, whereas ml-essentials does not. Do not use this function if order matters, as when you rely on the data being shuffled.

func (*DataFrame) ReverseView ¶

func (df *DataFrame) ReverseView() *DataFrame

ReverseView flips the order of the rows.

func (*DataFrame) SampleView ¶

func (df *DataFrame) SampleView(n int, replacement bool) *DataFrame

SampleView randomly samples n rows from the dataframe. Sampling with replacement is not yet supported. Sampling without replacement is functionally equivalent to:

df.ShuffleView().SliceView(0, n)

func (*DataFrame) ShallowCopy ¶

func (df *DataFrame) ShallowCopy() *DataFrame

ShallowCopy copies the dataframe's structure but not the data. In 90% of cases, you would rather use View(), which doesn't even copy the structure up until the structure is modified. ShallowCopy returns a view on the dataframe.

func (*DataFrame) ShuffleView ¶

func (df *DataFrame) ShuffleView() *DataFrame

ShuffleView randomizes the dataframe. This is functionally equivalent to this pseudo-code:

indices = range(0, df.NumRows())
shuffle(indices)
shuffledView = df.IndexView(indices)

If you want ShuffleView to behave deterministically, you need to call rand.Seed(seed) somewhere in your program prior to calling ShuffleView.

func (*DataFrame) SliceView ¶

func (df *DataFrame) SliceView(from int, to int) *DataFrame

SliceView builds a view from a slice of the dataframe from index "from" (included) to index "to" (excluded). If "from" or "to" is negative, the index is relative to the end of the dataframe. For example, -1 points to the last index of the dataframe. If "from" is higher than "to", the row order will be reversed.

func (*DataFrame) SortedView ¶

func (df *DataFrame) SortedView(byColumn string) *DataFrame

SortedView sorts the dataframe by ascending order of the given column. The column can either be a float, an int or a bool column. It will panic if the given column is neither of those. Missing values in integer columns will be treated as '-1'. If called on a bool column, it will put false values first. To sort in descending order, call SortedView(byColumn).ReverseView().

func (*DataFrame) SplitNView ¶

func (df *DataFrame) SplitNView(n int) []*DataFrame

SplitNView evenly divides the dataframe into n parts. It will panic if n is negative and returns nil if n equals zero. It always returns *exactly* n dataframes. As a result, some dataframes might be empty.

func (*DataFrame) SplitTrainTestViews ¶

func (df *DataFrame) SplitTrainTestViews(testingRatio float64) (*DataFrame, *DataFrame)

SplitTrainTestViews returns a training set and a testing set. testingRatio is a number between 0 and 1 such that: testSet.NumRows() * testingRatio = df.NumRows() It will panic if testingRatio is not between 0 and 1 included. SplitTrainTestViews does not shuffle the input dataframe. It is the user's responsibility to shuffle the dataframe prior to splitting it.

func (*DataFrame) SplitView ¶

func (df *DataFrame) SplitView(batchSize int) []*DataFrame

SplitView divides the dataframe into dataframes of *exactly* batchSize rows, except the last batch, which will be smaller if NumRows() % batchSize != 0. It will panic if batchSize is zero or negative.

func (*DataFrame) Strings ¶

func (df *DataFrame) Strings(colName string) StringAccess

Strings returns an iterator on a given string column.

func (*DataFrame) ThreadSafeMasking ¶

func (df *DataFrame) ThreadSafeMasking(enable bool) *DataFrame

ThreadSafeMasking makes the current dataframe and its views safe for masking. It returns the dataframe itself.

func (*DataFrame) To1CSV ¶

func (df *DataFrame) To1CSV(r io.Writer, options CSVWritingSpec) error

To1CSV writes the dataframe in CSV format into the writer given as argument. It returns an error if the writer doesn't allow writing. It also forwards any error raised by golang's builtin CSV writer. To1CSV flushes the writer before returning. This function is not multi-threaded.

func (*DataFrame) ToCSVDir ¶

func (df *DataFrame) ToCSVDir(options CSVWritingSpec, prefix string) ([]string, error)

ToCSVDir writes the dataframe in CSV format to files with the chosen prefix. The prefix includes the directory. Example of prefix: "/tmp/output/result" This will write /tmp/output/result01.csv, /tmp/output/result02.csv etc. The dataframe is split evenly between the files and each file is written separately within their dedicated go routine. It returns an error if one of the files doesn't allow writing or if the output directory does not exist. It also forwards any error raised by golang's builtin CSV writer. Alongside the potential error, ToCSVDir returns the list of files written.

func (*DataFrame) ToCSVFiles ¶

func (df *DataFrame) ToCSVFiles(options CSVWritingSpec, paths ...string) error

ToCSVFiles writes the dataframe in CSV format in the given files. The dataframe is split evenly between the files and each file is written separately within their dedicated go routine. It returns an error if one of the files doesn't allow writing. It also forwards any error raised by golang's builtin CSV writer.

func (*DataFrame) ToCSVs ¶

func (df *DataFrame) ToCSVs(writers []io.Writer, options CSVWritingSpec) error

To1CSV writes the dataframe in CSV format into the writers given as argument. The dataframe is split evenly between the writers and each writer is called separately within their dedicated go routine. The row order is not guaranteed. It returns an error if one of the writers doesn't allow writing. It also forwards any error raised by golang's builtin CSV writer.

func (*DataFrame) TopView ¶

func (df *DataFrame) TopView(byColumn string, n int, ascending bool, sorted bool) *DataFrame

TopView returns the n rows with the lowest values if ascending=true. It returns the rows with the highest values if ascending=false. The values that serve as criteria are the values from the column byColumn. The column can either be a float, an int or a bool column. It will panic if the given column is neither of those. If sorted=true, rows will always be sorted according to the desired order. If sorted=false, rows may or may not be sorted. If n is higher than the total number of rows, if will return all the rows. It will panic if the given column is neither of those. Missing values in integer columns will be treated as '-1'. If called on a bool column, false will be treated as lower than true.

func (*DataFrame) View ¶

func (df *DataFrame) View() *DataFrame

View makes the shallowest copy of the dataframe. It is roughly equivalent to:

copy := *df

Use this function when you want to transform an in-place operation into a view operation, e.g.:

view := df.View()
view.AllocateFloats("height")

func (*DataFrame) ZeroMask ¶

func (df *DataFrame) ZeroMask() []bool

ZeroMask returns a possibly pre-allocated mask for the MaskView function. The values of the mask are all initialized to false. Intended use:

m := df.ZeroMask()
for i := 0; i < df.NumRows(); i++ {
  if i % 10 == 0 {
     m[i] = true
  }
}
df = df.MaskView(m)

Do not concurrently use this function unless you call ThreadSafeMasking(True) first.

type DataFrameInternals ¶

type DataFrameInternals struct {
	DF *DataFrame
}

DataFrameInternals is a helper structure that lets you access the internals of a dataframe. Initialize the structure like that: DataFrameInternals{DF: your_df} Normally not needed, hence the lack of documentation.

func (DataFrameInternals) BoolData ¶

func (dfi DataFrameInternals) BoolData(columnName string) []bool

func (DataFrameInternals) FloatData ¶

func (dfi DataFrameInternals) FloatData(columnName string) []float64

func (DataFrameInternals) GetIndices ¶

func (dfi DataFrameInternals) GetIndices() []int

func (DataFrameInternals) GetMask ¶

func (dfi DataFrameInternals) GetMask() []bool

func (DataFrameInternals) IntData ¶

func (dfi DataFrameInternals) IntData(columnName string) []int

func (DataFrameInternals) ObjectData ¶

func (dfi DataFrameInternals) ObjectData(columnName string) []interface{}

type Dense64Batching ¶

type Dense64Batching struct {
	FloatBatching
	// contains filtered or unexported fields
}

func NewDense64Batching ¶

func NewDense64Batching(columns []string) *Dense64Batching

NewDense64Batching allocates a new Dense64Batching structure. Dense64Batching will copy the columns passed as arguments in the same order as given to this function. Dense64Batching recycles the data between successive calls to DenseMatrix, so try to call NewDense64Batching only once and DenseMatrix as many times as needed.

func (*Dense64Batching) DenseMatrix ¶

func (bat *Dense64Batching) DenseMatrix(df *DataFrame) mat.Matrix

DenseMatrix generates a gonum dense matrix from the given dataframe. The returned data is a copy of the dataframe's data, so changing the matrix doesn't change the dataframe. The returned matrix is stored as a transpose. Any operation on this matrix will be faster if it involves another transpose. Dense64Batching recycles the data between successive calls to DenseMatrix, so try to call NewDense64Batching only once and DenseMatrix as many times as needed.

type Float32Iterator ¶

type Float32Iterator struct {
	RowIterator
	// contains filtered or unexported fields
}

Float32Iterator is a structure to iterate over a dataframe one row at a time. The rows provided to the user will be slices of float32. Float32Iterator cannot iterate through object columns. The delivered rows can be safely changed with no effect on the dataframe.

func NewFloat32Iterator ¶

func NewFloat32Iterator(df *DataFrame, columns []string) *Float32Iterator

NewFloat32Iterator allocates a new row iterator to allow you to iterate over float, bool and int columns as floats. If a given column is not float, bool or int, it will be ignored. Row elements will be delivered in the same order as the columns passed as argument.

func (*Float32Iterator) NextRow ¶

func (ite *Float32Iterator) NextRow() ([]float32, int, int)

NextRow returns a single row, its index in the view and its index in the original data. If there is no more row, it returns nil, the size of the view and the size of the original data. You can safely change the values of the row since they are copies of the original data. However, NextRow recycles the float slice, so you shouldn't store the slice.

type FloatAccess ¶

type FloatAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

FloatAccess is a random-access iterator for float columns.

func (FloatAccess) Get ¶

func (access FloatAccess) Get(row int) float64

Get returns the float value at the given index.

func (FloatAccess) Set ¶

func (access FloatAccess) Set(row int, val float64)

Set overwrites the float value at the given index.

func (FloatAccess) VecDense ¶

func (access FloatAccess) VecDense() *mat.VecDense

VecDense creates a gonum's VecDense object from the dataframe's float data. The data is not copied if the underlying dataframe is contiguous, otherwise the data is copied. Use this function if you are not going to change the returned VecDense and want to avoid an unnecessary copy.

func (FloatAccess) VecDenseCopy ¶

func (access FloatAccess) VecDenseCopy() *mat.VecDense

VecDense creates a gonum's VecDense object from the dataframe's float data. It always copies the data, so you can change the returned VecDense without changing the dataframe.

type FloatBatching ¶

type FloatBatching struct {
	// contains filtered or unexported fields
}

type IntAccess ¶

type IntAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

IntAccess is a random-access iterator for integer columns.

func (IntAccess) Get ¶

func (access IntAccess) Get(row int) int

Get returns the integer at the given index.

func (IntAccess) Set ¶

func (access IntAccess) Set(row int, val int)

Set overwrites the integer value at the given index.

type ObjectAccess ¶

type ObjectAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

ObjectAccess is a random-access iterator for object columns, including string columns.

func (ObjectAccess) Get ¶

func (access ObjectAccess) Get(row int) interface{}

Get returns the object at the given index.

func (ObjectAccess) Set ¶

func (access ObjectAccess) Set(row int, val interface{})

Set overwrites the object at the given index.

type ObjectType ¶

type ObjectType uint8

ObjectType allows us to distinguish between the possible types of data contained in the object columns.

const (
	AnyObject ObjectType = iota
	StringObject
)

type RawData ¶

type RawData struct {
	// contains filtered or unexported fields
}

RawData is the structure that holds the data of the dataframes. RawData has no concept of index view, so it always manipulates columns as contiguous blocks of data. When encapsulated in a DataBuilder, RawData allows the user to temporarily create columns of different size, up until it is converted to a DataFrame.

func FromCSV ¶

func FromCSV(r io.Reader, options CSVReadingSpec) (*RawData, error)

FromCSV reads CSV data and returns a RawData structure with automatically inferred column types. It returns any error returned by golang's builtin CSV reader. With the default options, types are inferred this way: - If the column is 100% made of values that can be parsed as bools (0, 1, true, True, false, False or any other variant), it is stored as a bools. - Otherwise, if it is 100% made of integers or missing values, it is stored as an integer column. Integer missing values are replaced with -1. - Otherwise, if it is 100% made of floats or missing values, it is stored as a float column. Float missing values are replaced with NaN. - If none of the above match, the column is stored as a string column.

func FromCSVFile ¶

func FromCSVFile(path string, options CSVReadingSpec) (*RawData, error)

FromCSVFile reads a CSV file and returns a RawData structure with automatically inferred column types. It returns any error returned by golang's builtin CSV reader. It also returns an error if the file cannot be opened. For the type inference, refer to FromCSV's documentation.

func FromCSVFilePattern ¶

func FromCSVFilePattern(glob string, options CSVReadingSpec) (*RawData, error)

FromCSVFilePattern searches for file paths that matches the given glob pattern, reads them and returns a single RawData structure containing all the data packed in an unordered fashion. It returns any error returned by golang's builtin CSV reader. It also returns an error if any of the matching file can't be opened. If no file can be found, it returns (nil, nil). For the type inference, refer to FromCSV's documentation.

func MergeRawDataColumns ¶

func MergeRawDataColumns(list []*RawData) *RawData

MergeRawDataColumns transfers data from multiple RawData structures. It basically calls TransferRawDataFrom on each RawData passed as argument and it is subject to the same limitations.

func MergeRawDataRows ¶

func MergeRawDataRows(list []*RawData) *RawData

MergeRawDataRows concatenates multiple RawData together in a row-wise manner. RawData can have different columns (not recommended), but be aware that it will panic if you try right away to uprade the resulting RawData to a dataframe. The data will be copied and the given structures won't share data, so altering one of the RawData later won't affect the returned RawData. It is the equivalent of numpy.concat(list, axis=0)

func NewRawData ¶

func NewRawData() *RawData

NewRawData allocates a new RawData structure.

func (*RawData) ActualMaxCPU ¶

func (data *RawData) ActualMaxCPU() int

ActualMaxCPU returns the maximum number of CPUs that are allowed to be utilized by the functions operating on the dataframe. If such a maximum number of CPUs was never set or or if it was set with a number higher than the number of CPU cores on your machine, it will return the number of CPU cores on your machine.

func (*RawData) AllocBools ¶

func (data *RawData) AllocBools(columns ...string)

AllocBools allocates new empty float columns.

func (*RawData) AllocFloats ¶

func (data *RawData) AllocFloats(columns ...string)

AllocFloats allocates new empty float columns.

func (*RawData) AllocInts ¶

func (data *RawData) AllocInts(columns ...string)

AllocInts allocates new empty integer columns.

func (*RawData) AllocObjects ¶

func (data *RawData) AllocObjects(columns ...string)

AllocObjects allocates new empty object columns.

func (*RawData) AllocStrings ¶

func (data *RawData) AllocStrings(columns ...string)

AllocStrings allocates new empty string columns.

func (*RawData) BoolHeader ¶

func (data *RawData) BoolHeader() ColumnHeader

BoolHeader returns a ColumnHeader with all the boolean column names. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) BoolToFloats ¶

func (data *RawData) BoolToFloats(columns ...string)

BoolToFloats converts boolean columns into float columns. Note that this runs as at a view-free level since it wouldn't make sense to convert only parts of dataframe's column, given that mixed types are not allowed for numerical columns.

func (*RawData) CheckConsistency ¶

func (data *RawData) CheckConsistency(t *testing.T) bool

func (*RawData) CreateColumnQueue ¶

func (data *RawData) CreateColumnQueue(columns []string) utils.StringQ

Only used by other ml-essentials' packages.

func (*RawData) Drop ¶

func (data *RawData) Drop(columns ...string)

Drop removes the given columns.

func (*RawData) FloatHeader ¶

func (data *RawData) FloatHeader() ColumnHeader

FloatHeader returns a ColumnHeader with all the float column names. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (data *RawData) Header() ColumnHeader

Header returns a ColumnHeader with all the column names. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) IntHeader ¶

func (data *RawData) IntHeader() ColumnHeader

IntHeader returns a ColumnHeader with all the integer column names Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) IntToFloats ¶

func (data *RawData) IntToFloats(columns ...string)

IntToFloats converts integer columns into float columns. Note that this runs as at a view-free level since it wouldn't make sense to convert only parts of dataframe's column, given that mixed types are not allowed for numerical columns.

func (*RawData) NumAllocatedRows ¶

func (data *RawData) NumAllocatedRows() int

NumAllocatedRows returns the total number of allocated rows. If this is called via a pointer on a dataframe, the number of allocated rows can be different than the value returned by NumRows(). This method is not reliable on a RawData under construction, e.g. when its columns are being built via a DataBuilder.

func (*RawData) NumColumns ¶

func (data *RawData) NumColumns() int

NumColumns returns the total number of columns.

func (*RawData) ObjectHeader ¶

func (data *RawData) ObjectHeader() ColumnHeader

ObjectHeader returns a ColumnHeader with all the object column names, including that of string columns. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) PrintUIDs ¶

func (df *RawData) PrintUIDs()

func (*RawData) Rename ¶

func (data *RawData) Rename(oldName string, newName string)

Rename changes the name of a column. The new column will be of the same type and share the same data. For example, if you execute:

df.View().Rename("apples", "oranges").Ints("oranges").Set(0, 42)

It will change df's number of apples to 42 at index=0.

func (*RawData) SetMaxCPU ¶

func (data *RawData) SetMaxCPU(maxCPU int)

SetMaxCPU sets the number of CPUs that are allowed to be utilized by the functions operating on the dataframe and any view on the dataframe. If maxCPU is 0 or negative, Max CPU will be set to the number of CPU cores on your machine.

func (*RawData) StringHeader ¶

func (data *RawData) StringHeader() ColumnHeader

BoolHeader returns a ColumnHeader with all the string column names. Altering the returned ColumnHeader has no effect on the underlying RawData.

func (*RawData) ToDataFrame ¶

func (data *RawData) ToDataFrame() *DataFrame

ToDataFrame upgrades the RawData structure to a DataFrame. It will panic if the columns are of different size. The data is shared between the original RawData and the returned dataframe, so any change to the RawData will affect the dataframe, and vice versa.

func (*RawData) TransferRawDataFrom ¶

func (data *RawData) TransferRawDataFrom(from *RawData)

TransferRawDataFrom adds the data from another RawData structure. The two structures will share data, so changing one will change the other. Reminder: RawData's functions are ignorant of dataframe indices, so don't expect this function to exlusively transfer viewed data when it's called on a dataframe.

func (*RawData) Unshare ¶

func (data *RawData) Unshare(columns ...string)

Unshare is the in-place, low-level version of DataFrame.DetachView(). For your own sake, please use DataFrame.DetachView() instead.

type RowIterator ¶

type RowIterator struct {
	FloatBatching
	// contains filtered or unexported fields
}

func (*RowIterator) Columns ¶

func (ite *RowIterator) Columns() []string

Columns returns the name of the columns ordered like the row elements are ordered. If called on Float32Iterator or Float64Iterator, it returns the list of columns passed to NewFloat32Iterator and NewFloat64Iterator respectively.

func (*RowIterator) Reset ¶

func (ite *RowIterator) Reset(df *DataFrame, check bool)

Reset recycles the iterator's pre-allocated data for another dataframe with the same columns. If check is true, it will be verified that the columns are the same.

type StringAccess ¶

type StringAccess struct {
	ColumnAccess
	// contains filtered or unexported fields
}

StringAccess is a random-access iterator for string columns

func (StringAccess) Get ¶

func (access StringAccess) Get(row int) string

Get returns the string at the given index.

func (StringAccess) Set ¶

func (access StringAccess) Set(row int, val string)

Set overwrites the string at the given index.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL