imports

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 5, 2021 License: MIT Imports: 19 Imported by: 0

Documentation

Overview

Package imports provides functionality to read data contained in another format to populate a DataFrame. It provides inverse functionality to the exports package.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func LoadFromCSV

func LoadFromCSV(ctx context.Context, r io.ReadSeeker, options ...CSVLoadOptions) (*dataframe.DataFrame, error)

LoadFromCSV will load data from a csv file.

func LoadFromJSON

func LoadFromJSON(ctx context.Context, r io.ReadSeeker, options ...JSONLoadOptions) (*dataframe.DataFrame, error)

LoadFromJSON will load data from a jsonl file. The first row determines which fields will be imported for subsequent rows.

func LoadFromParquet

func LoadFromParquet(ctx context.Context, src source.ParquetFile, opts ...ParquetLoadOptions) (*dataframe.DataFrame, error)

LoadFromParquet will load data from a parquet file.

NOTE: This function is experimental and the implementation is likely to change.

Example (gist):

import	"github.com/xitongsys/parquet-go-source/local"
import	"github.com/rocketlaunchr/dataframe-go/imports"

func main() {
	fr, _ := local.NewLocalFileReader("file.parquet")
	defer fr.Close()

	df, _ := imports.LoadFromParquet(ctx, fr)
}

func LoadFromSQL

func LoadFromSQL(ctx context.Context, stmt interface{}, options *SQLLoadOptions, args ...interface{}) (*dataframe.DataFrame, error)

LoadFromSQL will load data from a sql database. stmt must be a *sql.Stmt or the equivalent from the mysql-go package.

See: https://godoc.org/github.com/rocketlaunchr/mysql-go#Stmt

Types

type CSVLoadOptions

type CSVLoadOptions struct {

	// Comma is the field delimiter.
	// The default value is ',' when CSVLoadOption is not provided.
	// Comma must be a valid rune and must not be \r, \n,
	// or the Unicode replacement character (0xFFFD).
	Comma rune

	// Comment, if not 0, is the comment character. Lines beginning with the
	// Comment character without preceding whitespace are ignored.
	// With leading whitespace the Comment character becomes part of the
	// field, even if TrimLeadingSpace is true.
	// Comment must be a valid rune and must not be \r, \n,
	// or the Unicode replacement character (0xFFFD).
	// It must also not be equal to Comma.
	Comment rune

	// If TrimLeadingSpace is true, leading white space in a field is ignored.
	// This is done even if the field delimiter, Comma, is white space.
	TrimLeadingSpace bool

	// LargeDataSet should be set to true for large datasets.
	// It will set the capacity of the underlying slices of the Dataframe by performing a basic parse
	// of the full dataset before processing the data fully.
	// Preallocating memory can provide speed improvements. Benchmarks should be performed for your use-case.
	LargeDataSet bool

	// DictateDataType is used to inform LoadFromCSV what the true underlying data type is for a given field name.
	// The key must be the case-sensitive field name.
	// The value for a given key must be of the data type of the data.
	// eg. For a string use "". For a int64 use int64(0). What is relevant is the data type and not the value itself.
	//
	// NOTE: A custom Series must implement NewSerieser interface and be able to interpret strings to work.
	DictateDataType map[string]interface{}

	// NilValue allows you to set what string value in the CSV file should be interpreted as a nil value for
	// the purposes of insertion.
	//
	// Common values are: NULL, \N, NaN, NA
	NilValue *string

	// InferDataTypes can be set to true if the underlying data type should be automatically detected.
	// Using DictateDataType is the recommended approach (especially for large datasets or memory constrained systems).
	// DictateDataType always takes precedence when determining the type.
	// If the data type could not be detected, NewSeriesString is used.
	InferDataTypes bool

	// Headers must be set if the CSV file does not contain a header row. This must be nil if the CSV file contains a
	// header row.
	Headers []string
}

CSVLoadOptions is likely to change.

type Converter

type Converter struct {
	ConcreteType  interface{}
	ConverterFunc GenericDataConverter
}

Converter is used to convert input data into a generic data type. This is required when importing data for a Generic Series ("dataframe.SeriesGeneric"). As a special case, if ConcreteType is time.Time, then a SeriesTime is used.

Example:

opts := imports.CSVLoadOptions{
   DictateDataType: map[string]interface{}{
      "Date": imports.Converter{
         ConcreteType: time.Time{},
         ConverterFunc: func(in interface{}) (interface{}, error) {
            return time.Parse("2006-01-02", in.(string))
         },
      },
   },
}

type Database

type Database int

Database is used to set the Database. Different databases have different syntax for placeholders etc.

const (
	// PostgreSQL database
	PostgreSQL Database = 0
	// MySQL database
	MySQL Database = 1
)

type GenericDataConverter

type GenericDataConverter func(in interface{}) (interface{}, error)

GenericDataConverter is used to convert input data into a generic data type. This is required when importing data for a Generic Series ("SeriesGeneric").

type JSONLoadOptions

type JSONLoadOptions struct {

	// LargeDataSet should be set to true for large datasets.
	// It will set the capacity of the underlying slices of the Dataframe by performing a basic parse
	// of the full dataset before processing the data fully.
	// Preallocating memory can provide speed improvements. Benchmarks should be performed for your use-case.
	LargeDataSet bool

	// DictateDataType is used to inform LoadFromJSON what the true underlying data type is for a given field name.
	// The key must be the case-sensitive field name.
	// The value for a given key must be of the data type of the data.
	// eg. For a string use "". For a int64 use int64(0). What is relevant is the data type and not the value itself.
	//
	// NOTE: A custom Series must implement NewSerieser interface and be able to interpret strings to work.
	DictateDataType map[string]interface{}

	// ErrorOnUnknownFields will generate an error if an unknown field is encountered after the first row.
	ErrorOnUnknownFields bool
}

JSONLoadOptions is likely to change.

type ParquetLoadOptions

type ParquetLoadOptions struct {
}

ParquetLoadOptions is likely to change.

type SQLLoadOptions

type SQLLoadOptions struct {

	// KnownRowCount is used to set the capacity of the underlying slices of the Dataframe.
	// The maximum number of rows supported (on a 64-bit machine) is 9,223,372,036,854,775,807 (half of 64 bit range).
	// Preallocating memory can provide speed improvements. Benchmarks should be performed for your use-case.
	//
	// WARNING: Some databases may allow tables to contain more rows than the maximum supported.
	KnownRowCount *int

	// DictateDataType is used to inform LoadFromSQL what the true underlying data type is for a given column name.
	// The key must be the case-sensitive column name.
	// The value for a given key must be of the data type of the data.
	// eg. For a string use "". For a int64 use int64(0). What is relevant is the data type and not the value itself.
	//
	// NOTE: A custom Series must implement NewSerieser interface and be able to interpret strings to work.
	DictateDataType map[string]interface{}

	// Database is used to set the Database.
	Database Database

	// Query can be set to the sql stmt if a *sql.DB, *sql.TX, *sql.Conn or the equivalent from the mysql-go package is provided.
	//
	// See: https://godoc.org/github.com/rocketlaunchr/mysql-go
	Query string
}

SQLLoadOptions is likely to change.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL