dstream

package

v0.0.0-...-c4c4106 Latest Latest Go to latest Published: May 12, 2019 License: BSD-3-Clause Imports: 19 Imported by: 2

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/kshedden/dstream

Links

Open Source Insights

Documentation ¶

Overview ¶

Package dstream provides facilities for manipulating streams of typed, multivariate data. A Dstream is a data container that holds a rectangular array of data in which the columns are variables and the rows are cases or observations. The dstream framework facilitates processing data of this type in a streaming manner, with a focus on feeding the data into statistical modeling tools.

The data held by a Dstream are retrieved as chunks of contiguous rows. Within each chunk, the data are stored column-wise. A Dstream visits its chunks in order. When processing a Dstream, call Next to advance to the next chunk, then call Get or GetPos to retrieve the data for one column.

Most operations on Dstreams take the form of a transformation d = f(d), where d is a Dstream. Many transformations are defined in the package, and it is easy to add new transformations. Transformations can be composed to produce complex transformations from simple components. Examples of transformations are Mutate (modify a column in-place) and DropCols (drop one or more columns from the Dstream).

Index ¶

func CheckValid(data Dstream) bool
func Describe(data Dstream) map[string]Stats
func Equal(x, y Dstream) bool
func EqualReport(x, y Dstream, report bool) bool
func GetCol(da Dstream, na string) interface{}
func GetColPos(da Dstream, j int) interface{}
func NewBCols(path string, chunksize int) *bcols
func Save(ds Dstream, w io.Writer)
func VarMap(d Dstream) map[string]interface{}
func VarPos(d Dstream) map[string]int
func VarTypes(d Dstream) map[string]string
type BColsWriter
- func NewBColsWriter(d Dstream) *BColsWriter
- func (bw *BColsWriter) Done()
- func (bw *BColsWriter) Path(p string) *BColsWriter
type CSVReader
- func FromCSV(r io.Reader) *CSVReader
- func (cs *CSVReader) ChunkSize(c int) *CSVReader
- func (cs *CSVReader) Close()
- func (cs *CSVReader) Comma(c rune) *CSVReader
- func (cs *CSVReader) Done() Dstream
- func (cs *CSVReader) Get(na string) interface{}
- func (cs *CSVReader) GetPos(j int) interface{}
- func (cs *CSVReader) HasHeader() *CSVReader
- func (cs *CSVReader) LimitChunk(n int) *CSVReader
- func (cs *CSVReader) Names() []string
- func (cs *CSVReader) Next() bool
- func (cs *CSVReader) NumObs() int
- func (cs *CSVReader) NumVar() int
- func (cs *CSVReader) ParseTime(f func(string) time.Time) *CSVReader
- func (cs *CSVReader) Reset()
- func (cs *CSVReader) SetTypes(types []VarType) *CSVReader
- func (cs *CSVReader) SkipErrors() *CSVReader
type CSVWriter
- func ToCSV(d Dstream) *CSVWriter
- func (dw *CSVWriter) Done() error
- func (dw *CSVWriter) Filename(name string) *CSVWriter
- func (dw *CSVWriter) FloatFmt(fmt string) *CSVWriter
- func (dw *CSVWriter) Formats(fmts map[string]string) *CSVWriter
- func (dw *CSVWriter) SetWriter(w io.Writer) *CSVWriter
type DataFrame
- func (x *DataFrame) Close()
- func (da *DataFrame) Get(na string) interface{}
- func (da *DataFrame) GetPos(j int) interface{}
- func (da *DataFrame) Names() []string
- func (da *DataFrame) Next() bool
- func (da *DataFrame) NumObs() int
- func (da *DataFrame) NumVar() int
- func (da *DataFrame) Reset()
type Dstream
- func AddCol(da Dstream, newdat []float64, newname string) Dstream
- func Center(source Dstream, names ...string) Dstream
- func ConcatHorizontal(streams ...Dstream) Dstream
- func ConcatVertical(streams ...Dstream) Dstream
- func Convert(da Dstream, vname string, dtype Dtype) Dstream
- func Dechunk(source Dstream) Dstream
- func DiffChunk(data Dstream, order map[string]int) Dstream
- func DropCols(data Dstream, dropvars ...string) Dstream
- func DropNA(data Dstream) Dstream
- func Filter(data Dstream, f FilterFunc) Dstream
- func Generate(data Dstream, name string, fnc GenerateFunc, dtype Dtype) Dstream
- func LagChunk(data Dstream, lags map[string]int) Dstream
- func Linapply(data Dstream, coeffs [][]float64, basename string) Dstream
- func MaxChunkSize(data Dstream, size int) Dstream
- func MemCopy(data Dstream, reset bool) Dstream
- func Mutate(ds Dstream, name string, f MutateFunc) Dstream
- func NewFromArrays(data [][]interface{}, names []string) Dstream
- func NewFromFlat(data []interface{}, names []string) Dstream
- func NewLoad(r io.Reader) Dstream
- func Regroup(ds Dstream, groupvar string, sortchunks bool) Dstream
- func ReplaceColumn(data Dstream, name string, coldata interface{}) Dstream
- func Segment(data Dstream, vars ...string) Dstream
- func SelectCols(data Dstream, keepvars ...string) Dstream
- func Shallow(data Dstream) Dstream
type Dtype
type FilterFunc
type GenerateFunc
type Join
- func NewJoin(data []Dstream, names []string) *Join
- func (w *Join) Next() bool
type MutateFunc
type Stats
type VarType

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CheckValid ¶

func CheckValid(data Dstream) bool

CheckValid runs through the chunks and confirms that the lengths of the slices within the chunks are the same. If CheckValid returns false, the dstream is in a corrupted state. On completion, the dstream is in its initial state.

func Describe ¶

func Describe(data Dstream) map[string]Stats

Describe computes summary statistics for the float64 columns of a dstream.

func Equal ¶

func Equal(x, y Dstream) bool

Equal returns true if the two Dstream values have equal contents.

func EqualReport ¶

func EqualReport(x, y Dstream, report bool) bool

EqualReport compares two Dstream values. If they are not equal, further information is written to the standard error stream. Equality here implies that the data values, types, order, and chunk boundaries are all identical.

func GetCol ¶

func GetCol(da Dstream, na string) interface{}

GetCol returns a copy of the data for one variable. The data are returned as a slice. The column is returned starting with the current chunk, call Reset to ensure that the column is extracted from the first chunk.

func GetColPos ¶

func GetColPos(da Dstream, j int) interface{}

GetColPos returns a copy of the data for one variable. The data are returned as a slice, which is a coy of the underlying data.

func NewBCols ¶

func NewBCols(path string, chunksize int) *bcols

NewBCols takes data stored in a column-wise compressed format under the given directory path, and returns it as a Dstream. 'path' is the directory containing the data, 'chunksize' is the number of values included in each chunk, 'include' and 'exclude' are lists of variable names to include (exclude) respectively.

The underlying bcols format is simple. Each column of data is stored in its own file, in binary native format, compressed using either gzip or snappy compression. A file called 'dtypes.json' contains a dictionary mapping variable names to data types.

func Save ¶

func Save(ds Dstream, w io.Writer)

Save saves the dstream to a file. It can subsequently be reloaded with Load.

func VarMap ¶

func VarMap(d Dstream) map[string]interface{}

VarMap returns a map from variable names to data slices, in the current chunk.

func VarPos ¶

func VarPos(d Dstream) map[string]int

VarPos returns a map from variable names to their corresponding column positions.

func VarTypes ¶

func VarTypes(d Dstream) map[string]string

VarTypes returns a map relating each variable by name to its corresponding data type. The dstream should be readable before calling (Reset and call Next if needed).

Types ¶

type BColsWriter ¶

type BColsWriter struct {
	// contains filtered or unexported fields
}

BColsWriter writes a dstream to disk in bcols format.

func NewBColsWriter ¶

func NewBColsWriter(d Dstream) *BColsWriter

NewBColsWriter creates a new BColsWriter that writes the given dstream.

func (*BColsWriter) Done ¶

func (bw *BColsWriter) Done()

Done flushes the data to disk.

func (*BColsWriter) Path ¶

func (bw *BColsWriter) Path(p string) *BColsWriter

Path sets the location (a directory path) to which the data are written.

type CSVReader ¶

type CSVReader struct {
	// contains filtered or unexported fields
}

CSVReader supports reading a Dstream from an io.Reader.

func FromCSV ¶

func FromCSV(r io.Reader) *CSVReader

FromCSV returns a Dstream that reads from a CSV source. Call at least one SetXX method to define variables to be retrieved. For further configuration, chain calls to other SetXXX methods, and finally call Done to produce the Dstream.

Example ¶

data := `Food,Type,Weight,Price
Banana,Fruit,13,9
Cucumber,Vegetable,15,5
Cheese,Dairy,12,35
Lamb,Meat,40,76
`

// Here we read from an in-memory byte buffer,
// but this can be any io.Reader, e.g. a file.
b := bytes.NewBuffer([]byte(data))

types := []VarType{
	{"Food", String},
	{"Type", String},
	{"Weight", Float64},
}
da := FromCSV(b).SetTypes(types).HasHeader().Done()
da.Next() // Always call Next before first call to Get or GetPos

y := da.Get("Type").([]string)
fmt.Printf("%v\n", y)

x := da.Get("Weight").([]float64)
fmt.Printf("%v\n", x)

Output:

[Fruit Vegetable Dairy Meat]
[13 15 12 40]

func (*CSVReader) ChunkSize ¶

func (cs *CSVReader) ChunkSize(c int) *CSVReader

ChunkSize sets the size of chunks for this Dstream, it can only be called before reading begins.

func (*CSVReader) Close ¶

func (cs *CSVReader) Close()

Close does nothing and is implemented to satisfy the Dstream interface. If any io.Reader values passed to FromCSV need closing, they should be closed by the caller.

func (*CSVReader) Comma ¶

func (cs *CSVReader) Comma(c rune) *CSVReader

Comma sets the delimiter (comma rune) for the CSVReader. By default, the comma rune is a comma.

func (*CSVReader) Done ¶

func (cs *CSVReader) Done() Dstream

Done is called when all configuration is complete. After calling Done, the DStream can be used.

func (*CSVReader) Get ¶

func (cs *CSVReader) Get(na string) interface{}

Get returns a chunk of a data column by name.

func (*CSVReader) GetPos ¶

func (cs *CSVReader) GetPos(j int) interface{}

GetPos returns a chunk of a data column by column position.

func (*CSVReader) HasHeader ¶

func (cs *CSVReader) HasHeader() *CSVReader

HasHeader indicates that the first row of the data file contains column names. The default behavior is that there is no header.

func (*CSVReader) LimitChunk ¶

func (cs *CSVReader) LimitChunk(n int) *CSVReader

LimitChunk sets the number of chunks to read.

func (*CSVReader) Names ¶

func (cs *CSVReader) Names() []string

Names returns the names of the variables in the dstream.

func (*CSVReader) Next ¶

func (cs *CSVReader) Next() bool

Next advances to the next chunk.

func (*CSVReader) NumObs ¶

func (cs *CSVReader) NumObs() int

NumObs returns the number of observations in the dstream. If the dstream has not been fully read, returns -1.

func (*CSVReader) NumVar ¶

func (cs *CSVReader) NumVar() int

NumVar returns the number of variables in the dstream.

func (*CSVReader) ParseTime ¶

func (cs *CSVReader) ParseTime(f func(string) time.Time) *CSVReader

ParseTime sets the time parsing function./

func (*CSVReader) Reset ¶

func (cs *CSVReader) Reset()

Reset attempts to reset the Dstream that is reading from an io.Reader. This is only possible if the underlying reader is seekable, so reset panics if the seek cannot be performed.

func (*CSVReader) SetTypes ¶

func (cs *CSVReader) SetTypes(types []VarType) *CSVReader

SetTypes species the types of the variables. If the CSV file has a header, these values may appear in any order, and variables not included in types are omitted. If the CSV file does not have a header, then types must match the columns in the file, in the correct order.

func (*CSVReader) SkipErrors ¶

func (cs *CSVReader) SkipErrors() *CSVReader

SkipErrors results in lines with unpareable CSV content being skipped (the csv.ParseError is printed to stdio).

type CSVWriter ¶

type CSVWriter struct {
	// contains filtered or unexported fields
}

CSVWriter supports writing a Dstream to an io.Writer in csv format.

func ToCSV ¶

func ToCSV(d Dstream) *CSVWriter

ToCSV writes a Dstream in CSV format. Call SetWriter or Filename to configure the underlying writer, then call additional methods for customization as desired, and finally call Done to complete the writing.

Example ¶

data := `Food,Type,Weight,Price
Banana,Fruit,13,9
Cucumber,Vegetable,15,5
Cheese,Dairy,12,35
Lamb,Meat,40,76
`

b := bytes.NewBuffer([]byte(data))
types := []VarType{
	{"Food", String},
	{"Type", String},
	{"Weight", Float64},
}
da := FromCSV(b).SetTypes(types).HasHeader().Done()

var buf bytes.Buffer
_ = ToCSV(da).SetWriter(&buf).FloatFmt("%.0f").Done()

fmt.Printf("%s\n", buf.String())

Output:

Food,Type,Weight
Banana,Fruit,13
Cucumber,Vegetable,15
Cheese,Dairy,12
Lamb,Meat,40

func (*CSVWriter) Done ¶

func (dw *CSVWriter) Done() error

Done completes writing a Dstream to a specified io.Writer in csv format.

func (*CSVWriter) Filename ¶

func (dw *CSVWriter) Filename(name string) *CSVWriter

Filename configures the CSVWriter to write to the given named file.

func (*CSVWriter) FloatFmt ¶

func (dw *CSVWriter) FloatFmt(fmt string) *CSVWriter

FloatFmt sets the format string to be used when writing float values. This value is ignored for columns specified in a call to the Formats method.

func (*CSVWriter) Formats ¶

func (dw *CSVWriter) Formats(fmts map[string]string) *CSVWriter

Formats sets format strings to be used when writing the Dstream. The provided argument is a map from variable names to variable formats.

func (*CSVWriter) SetWriter ¶

func (dw *CSVWriter) SetWriter(w io.Writer) *CSVWriter

SetWriter configures the CSVWriter to write to the given io stream.

type DataFrame ¶

type DataFrame struct {
	// contains filtered or unexported fields
}

DataFrame is an implementation of Dstream based on sharded arrays.

func (*DataFrame) Close ¶

func (x *DataFrame) Close()

func (*DataFrame) Get ¶

func (da *DataFrame) Get(na string) interface{}

Get returns the data slice for the variable with the given name.

func (*DataFrame) GetPos ¶

func (da *DataFrame) GetPos(j int) interface{}

GetPos returns the data slice for the variable at the given position.

func (*DataFrame) Names ¶

func (da *DataFrame) Names() []string

Names returns the variable (column) names of the dstream.

func (*DataFrame) Next ¶

func (da *DataFrame) Next() bool

Next advances to the next chunk and returns true if successful. If there are no more chunks, it returns false.

func (*DataFrame) NumObs ¶

func (da *DataFrame) NumObs() int

NumObs returns the number of observations in the DataFrame, if known. If the number of observations is not known, it returns -1.

func (*DataFrame) NumVar ¶

func (da *DataFrame) NumVar() int

NumVar returns the number of variables in the dstream.

func (*DataFrame) Reset ¶

func (da *DataFrame) Reset()

Reset resets the dstream so that after the next call to Next, the dstream is at chunk zero.

type Dstream ¶

type Dstream interface {

	// Next attempts to advance to the next chunk and returns true
	// if successful.
	Next() bool

	// Names returns the variable names.
	Names() []string

	// Get returns the values for one variable in the current
	// chunk, referring to the variable by name
	Get(string) interface{}

	// Get returns the values for one variable for the current
	// chunk, referring to the variable by position.
	GetPos(int) interface{}

	// NumVar returns the number of variables in the data set.
	NumVar() int

	// NumObs returns the number of rows in the data set, it may
	// return -1 if not known.
	NumObs() int

	// Reset sets the provider so that the data are read from the
	// beginning of the dataset.
	Reset()

	// Close frees any resources suh as file handles used by the dstream.
	Close()
}

Dstream streams chunks of data to a consumer.

func AddCol ¶

func AddCol(da Dstream, newdat []float64, newname string) Dstream

AddCol appends a new column of data to a Dstream. The new data is provided as a single array.

func Center ¶

func Center(source Dstream, names ...string) Dstream

Center returns a new Dstream in which the given columns have been mean-centered. Currently only works with float64 type data.

func ConcatHorizontal ¶

func ConcatHorizontal(streams ...Dstream) Dstream

ConcatHorizontal concatenates a collection of Dstreams horizontally. The column names of all the Dstreams being combined must be distinct.

func ConcatVertical ¶

func ConcatVertical(streams ...Dstream) Dstream

ConcatVertical concatenates a collection of Dstreams vertically (appending additional observations). The column names and data types of all the Dstreams being combined must be identical.

func Convert ¶

func Convert(da Dstream, vname string, dtype Dtype) Dstream

Convert returns a Dstream in which the named variable is converted to the given type.

func Dechunk ¶

func Dechunk(source Dstream) Dstream

Dechunk combines all chunks into a single chunk.

func DiffChunk ¶

func DiffChunk(data Dstream, order map[string]int) Dstream

DiffChunk returns a new Dstream in which specified variables are differenced. The differenced values are only computed within a chunk, not across chunk boundaries, and the first value of each chunk is omitted.

Example ¶

data := `V1,V2,V3,V4
1,2,3,4
1,0,4,5
2,4,5,6
3,0,6,8
3,1,5,9
`

b := bytes.NewBuffer([]byte(data))
types := []VarType{
	{"V1", Float64},
	{"V2", Float64},
	{"V3", Float64},
	{"V4", Float64},
}
da := FromCSV(b).SetTypes(types).HasHeader().Done()
da = DiffChunk(da, map[string]int{"V2": 1, "V4": 2})

for da.Next() {
	y := da.Get("V2$d1")
	fmt.Printf("%v\n", y)
	y = da.Get("V4$d2")
	fmt.Printf("%v\n", y)
}

Output:

[4 -4 1]
[0 1 -1]

func DropCols ¶

func DropCols(data Dstream, dropvars ...string) Dstream

DropCols removes the given variables from a Dstream.

func DropNA ¶

func DropNA(data Dstream) Dstream

DropNA returns a Dstream in which all rows containing any missing values are dropped.

func Filter ¶

func Filter(data Dstream, f FilterFunc) Dstream

Filter selects rows from the dstream. A filtering function determines which rows are selected.

Example ¶

data := `V1,V2,V3,V4
1,2,3,4
2,0,4,5
3,4,5,6
4,0,6,7
`

// A filtering function, selects if not equal to 0.
f := func(v map[string]interface{}, b []bool) {
	x := v["V2"].([]float64)
	for i := range x {
		b[i] = x[i] != 0
	}
}

types := []VarType{
	{"V1", Float64},
	{"V2", Float64},
	{"V3", Float64},
	{"V4", Float64},
}
b := bytes.NewBuffer([]byte(data))
da := FromCSV(b).SetTypes(types).HasHeader().Done()
da = Filter(da, f)

da.Next() // Always call Next before first call to Get or GetPos

y := da.Get("V1")
fmt.Printf("%v\n", y)

Output:

[1 3]

func Generate ¶

func Generate(data Dstream, name string, fnc GenerateFunc, dtype Dtype) Dstream

Generate appends a new variable to a Dstream, obtaining its values by applying the given function to the other variables in the Dstream. The new variable must not already exist in the Dstream.

Example ¶

data := `V1,V2,V3,V4
1,2,3,4
1,0,4,5
2,4,5,6
3,0,6,7
`

f := func(v map[string]interface{}, x interface{}) {
	v1 := v["V1"].([]float64)
	v2 := v["V2"].([]float64)
	y := x.([]float64)
	for i := range v1 {
		y[i] = v1[i] + v2[i]
	}
}

b := bytes.NewBuffer([]byte(data))

types := []VarType{
	{"V1", Float64},
	{"V2", Float64},
	{"V3", Float64},
	{"V4", Float64},
}
da := FromCSV(b).SetTypes(types).HasHeader().Done()
da = Generate(da, "V1p2", f, Float64)

for da.Next() {
	y := da.Get("V1p2")
	fmt.Printf("%v\n", y)
}

Output:

[3 1 6 3]

func LagChunk ¶

func LagChunk(data Dstream, lags map[string]int) Dstream

LagChunk returns a new Dstream in which specified variables are included with lagged values. Lagged values are only computed within a chunk, not across chunk boundaries, and the first m values of each chunk are omitted, where m is the maximum lag value.

Example ¶

data := `1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
`

b := bytes.NewBuffer([]byte(data))

types := []VarType{
	{"V1", Float64},
	{"V2", Float64},
	{"V3", Float64},
	{"V4", Float64},
}
da := FromCSV(b).SetTypes(types).Done()
da = LagChunk(da, map[string]int{"V2": 2})

da.Next() // Always call Next before first call to Get or GetPos

y := da.Get("V2[0]")
fmt.Printf("%v\n", y)
y = da.Get("V2[-1]")
fmt.Printf("%v\n", y)

Output:

[4 5]
[3 4]

func Linapply ¶

func Linapply(data Dstream, coeffs [][]float64, basename string) Dstream

Linapply adds new variables to Dstream by taking linear combinations of the other variables in the Dstream.

func MaxChunkSize ¶

func MaxChunkSize(data Dstream, size int) Dstream

MaxChunkSize splits the chunks of the input Dstream so that no chunk has more than size rows.

func MemCopy ¶

func MemCopy(data Dstream, reset bool) Dstream

MemCopy returns a Dstream that copies the provided Dstream into in-memory storage. Pass reset as true to reset the source data before copying

func Mutate ¶

func Mutate(ds Dstream, name string, f MutateFunc) Dstream

Mutate returns a Dstream in which the variable with the given name is transformed using the given function.

Example ¶

data := `V1,V2,V3,V4
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
`

// A mutating function, scales all values by 2.
timesTwo := func(x interface{}) {
	v := x.([]float64)
	for i := range v {
		v[i] *= 2
	}
}

types := []VarType{
	{"V1", Float64},
	{"V2", Float64},
	{"V3", Float64},
	{"V4", Float64},
}
b := bytes.NewBuffer([]byte(data))
da := FromCSV(b).SetTypes(types).HasHeader().Done()
da = Mutate(da, "V2", timesTwo)

da.Next() // Always call Next before first call to Get or GetPos

y := da.Get("V2")
fmt.Printf("%v\n", y)

Output:

[4 6 8 10]

func NewFromArrays ¶

func NewFromArrays(data [][]interface{}, names []string) Dstream

NewFromArrays creates a Dstream from raw data stored as slices; data[i][j] is the data for the i^th variable in the j^th chunk.

func NewFromFlat ¶

func NewFromFlat(data []interface{}, names []string) Dstream

NewFromFlat creates a Dstream from raw data stored as contiguous (flat) arrays. data[i] is the data for the i^th variable, it is a slice of fixed-width values, e.g. a []float64.

func NewLoad ¶

func NewLoad(r io.Reader) Dstream

NewLoad returns a dstream that loads data from the given file. The file must have been created using the Save function.

func Regroup ¶

func Regroup(ds Dstream, groupvar string, sortchunks bool) Dstream

Regroup creates a new Dstream from the provided Dstream having identical rows, but with the chunks defined by the values of a provided id variable. The resulting Dstream will have a chunk for each distinct level of the id variable, containing all the rows of the input Dstream with the given id value. The id variable must have uint64 type.

Example ¶

data := `V1,V2,V3
1,2,3
3,3,4
2,4,5
2,5,6
5,2,3
0,3,4
1,4,5
5,5,6
`

b := bytes.NewBuffer([]byte(data))

types := []VarType{
	{"V1", Float64},
	{"V2", Float64},
	{"V3", Float64},
}

d := FromCSV(b).SetTypes(types).HasHeader().Done()
d = Convert(d, "V1", Uint64)
d = Regroup(d, "V1", true)

for d.Next() {
	fmt.Printf("%v\n", d.GetPos(0))
	fmt.Printf("%v\n", d.GetPos(1))
	fmt.Printf("%v\n\n", d.GetPos(2))
}

Output:

[0]
[3]
[4]

[1 1]
[2 4]
[3 5]

[2 2]
[4 5]
[5 6]

[3]
[3]
[4]

[5 5]
[2 5]
[3 6]

func ReplaceColumn ¶

func ReplaceColumn(data Dstream, name string, coldata interface{}) Dstream

ReplaceColumn returns a new Dstream in which the column with the given name is replaced with the given data. The col value must be an array type of a valid primitive type (e.g. int, float64, string), and must have length equal to the number of rows of data.

func Segment ¶

func Segment(data Dstream, vars ...string) Dstream

Segment restructures the chunks of a Dstream so that chunk boundaries are determined by any change in the consecutive values of a specified set of variables.

Example ¶

data := `V1,V2,V3,V4
1,2,3,4
1,0,4,5
2,4,5,6
3,0,6,7
`

types := []VarType{
	{"V1", Float64},
	{"V2", Float64},
	{"V3", Float64},
	{"V4", Float64},
}
b := bytes.NewBuffer([]byte(data))
da := FromCSV(b).SetTypes(types).HasHeader().Done()
da = Segment(da, "V1")

for da.Next() {
	y := da.Get("V2")
	fmt.Printf("%v\n", y)
}

Output:

[2 0]
[4]
[0]

func SelectCols ¶

func SelectCols(data Dstream, keepvars ...string) Dstream

SelectCols retains only the given variables in a Dstream.

func Shallow ¶

func Shallow(data Dstream) Dstream

Shallow attempts to make a shallow copy of the data stream. Currently, only memory-backed data streams can be shallow copied, otherwise a deep copy is returned. Shallow copies of the same data can be read in parallel.

type Dtype ¶

type Dtype uint8

Dtype represents a data type

const (
	Uint8 Dtype = iota
	Uint16
	Uint32
	Uint64
	Int8
	Int16
	Int32
	Int64
	Float32
	Float64
	Time
	String
)

Uint8, etc. are constants defining data types

type FilterFunc ¶

type FilterFunc func(map[string]interface{}, []bool)

FilterFunc is a filtering function for use with Filter. The first argument holds a map from variable names to data slices. The second argument is a boolean slice that is initialized to all 'true'. The FilterFunc should set elements of the boolean slice to false wherever the corresponding dstream record should be excluded.

type GenerateFunc ¶

type GenerateFunc func(map[string]interface{}, interface{})

GenerateFunc is a function that can be used to generate a new variable from existing variables. The first argument is a map from variable names to data (whose concrete types are slices held as empty interfaces). The second argument is a pre-allocated array (a slice provided as an interface{}) into which the new variable's values are to be written. Note that the destination array is not set to zeros before passing to the generate function.

type Join ¶

type Join struct {

	// A sequence of segmented Dstreams to advance in unison.
	Data []Dstream

	// Status[j] means that the id variable for Data value j is
	// equal to the id variable for Data value 0.  Status[0] is
	// not used.
	Status []bool
	// contains filtered or unexported fields
}

Join performs a streaming join on several Dstreams that have been segmented by id variables. If join has type Join, then join.Data[i] is the current chunk of the i^th stream. All streams being joined must have been segmented by an id variable whose values are ascending. The id variable must have type uint64.

A call to the Next method always advances the first stream (Data[0]) by one chunk. The other elements of Data are advanced until their id variable is equal to (if possible) or greater than the id variable of Data[0]. If equality is achieved, the corresponding element of join.status is set to true. join.Status[0] is always false and has no meaning.

The dstream values to be joined must be segmented so that the id variable is constant within chunks, and increases in numeric value with subsequent calls to the Next method.

Example ¶

data1 := `V1,V2,V3,V4
1,2,3,4
1,3,4,5
3,4,5,6
3,5,6,7
`

data2 := `V1,V2,V3
1,2,3
1,3,4
1,4,5
3,5,6
`

data3 := `V1,V2,V3,V4,V5
2,2,3,5,6
2,3,4,7,5
3,4,5,3,4
4,5,6,2,3
`

names := [][]string{{"V1", "V2", "V3", "V4"},
	{"V1", "V2", "V3"},
	{"V1", "V2", "V3", "V4", "V5"}}
var da []Dstream

for j, data := range []string{data1, data2, data3} {

	b := bytes.NewBuffer([]byte(data))

	var types []VarType
	for _, na := range names[j] {
		types = append(types, VarType{na, Float64})
	}

	d := FromCSV(b).SetTypes(types).HasHeader().Done()
	d = Convert(d, "V1", Uint64)
	d = Segment(d, "V1")
	da = append(da, d)
}

join := NewJoin(da, []string{"V1", "V1", "V1"})

for join.Next() {
	fmt.Printf("%v\n", da[0].GetPos(0))
	if join.Status[1] {
		fmt.Printf("%v\n", da[1].GetPos(0))
	}
	if join.Status[2] {
		fmt.Printf("%v\n\n", da[2].GetPos(0))
	}
}

Output:

[1 1]
[1 1 1]
[3 3]
[3]
[3]

func NewJoin ¶

func NewJoin(data []Dstream, names []string) *Join

NewJoin creates a Join of the given Dstreams, using the variable names in names as ids. The Dstreams in data must be segmented by the inames variables before calling NewJoin.

func (*Join) Next ¶

func (w *Join) Next() bool

Next advances to the next chunk. The first dstream, which is contained in join.Data[0], always advances to the next sequential value of its id variable. The other dstreams (join.Data[j] for j > 0) advance until their id variables are equal to or greater than the id variable for the current chunk of join.Data[0]. The status field (join.status) indicates which dstreams in the join are currently on the same id value as the first dstream (join.Data[0]).

type MutateFunc ¶

type MutateFunc func(interface{})

MutateFunc is a function that can be used to change the values of a variable in-place.

type Stats ¶

type Stats struct {

	// The mean value
	Mean float64

	// The minimum value
	Min float64

	// The maximum value
	Max float64

	// The standard deviation of the values
	SD float64

	// The number of non inf/nan values
	N int

	// The number of Nan values
	NaN int

	// The number of Inf values
	Inf int
}

Stats contains summary statistics for a float64 Dstream variable.

type VarType ¶

type VarType struct {

	// The variable name
	Name string

	// The data type of the variable
	Type Dtype
}

VarType defines the name and type of one column of the CSV file.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL