datareader

package module
v0.0.0-...-8c617ee Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 22, 2018 License: BSD-3-Clause Imports: 11 Imported by: 0

README

datareader : read SAS and Stata files in Go

datareader is a pure Go (Golang) package that can read binary SAS format (SAS7BDAT) and Stata format (dta) data files into native Go data structures. For non-Go users, there are two command line utilities that convert SAS and Stata files into text file formats.

The Stata reader is based on the Stata documentation for the dta file format and supports dta versions 115, 117, and 118.

There is no official documentation for SAS binary format files. The code here is translated from the Python sas7bdat package, which in turn is based on an R package. Also see here for more information about the SAS7BDAT file structure.

This package also provides a simple column-oriented data container called a Series. Both the SAS reader and Stata reader return the data as an array of Series objects, corresponding to the columns of the data file. These can in turn be converted to other formats as needed.

Both the Stata and SAS reader support streaming access to the data (i.e. reading the file by chunks of consecutive records).

SAS

Here is an example of how the SAS reader can be used in a Go program (error handling omitted for brevity):

import (
        "datareader"
        "os"
)

// Create a SAS7BDAT object
f, _ := os.Open("filename.sas7bdat")
sas, _ := datareader.NewSAS7BDATReader(f)

// Read the first 10000 records (rows)
ds, _ := sas.Read(10000)

// If column 0 contains numeric data
// x is a []float64 containing the dta
// m is a []bool containing missingness indicators
x, m, _ := ds[0].AsFloat64Slice()

// If column 1 contains text data
// x is a []string containing the dta
// m is a []bool containing missingness indicators
x, m, _ := ds[1].AsStringSlice()

Stata

Here is an example of how the Stata reader can be used in a Go program (again with no error handling):

import (
        "datareader"
        "os"
)

// Create a StataReader object
f,_ := os.Open("filename.dta")
stata, _ := datareader.NewStataReader(f)

// Read the first 10000 records (rows)
ds, _ := stata.Read(10000)

CSV

The package includes a CSV reader with type inference for the column data types.

import (
        "datareader"
)

f, _ := os.Open("filename.csv")
rt := datareader.NewCSVReader(f)
rt.HasHeader = true
dt, _ := rt.Read(-1)
// obtain data from dt as in the SAS example above

Command line utilities

We provide two command-line utilities allowing conversion of SAS and Stata datasets to other formats without using Go. Executables for several OS's and architectures are contained in the bin directory. The script used to cross-compile these binaries is build.sh. To build and install the commands for your local architecture only, run the Makefile (the executables will be copied into your GOBIN directory).

The stattocsv command converts a SAS7BDAT or Stata dta file to a csv file, it can be used as follows:

> stattocsv file.sas7bdat > file.csv
> stattocsv file.dta > file.csv

The columnize command takes the data from either a SAS7BDAT or a Stata dta file, and writes the data from each column into a separate file. Numeric data can be stored in either binary (native 8 byte floats) or text format (binary is considerably faster).

> columnize -in=file.sas7bdat -out=cols -mode=binary
> columnize -in=file.dta -out=cols -mode=text

Testing

Automated testing is implemented against the Stata files used to test the pandas Stata reader (for versions 115+):

https://github.com/pydata/pandas/tree/master/pandas/io/tests/data

A CSV data file for testing is generated by the gendat.go script. There are scripts make.sas and make.stata in the test directory that generate SAS and Stata files for testing. SAS and Stata software are required to run these scripts. The generated files are provided in the test_files/data directory, so go test can be run without having access to SAS or Stata.

The columnize_test.go and stattocsv_test.go scripts test the commands against stored output.

Feedback

Please file an issue if you encounter a file that is not properly handled. If possible, share the file that causes the problem.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CSVReader

type CSVReader struct {

	// Skip this number of rows before reading the header.
	SkipRows int

	// If true, there is a header to read, otherwise default column names are used
	HasHeader bool

	// The column names, in the order that they appear in the
	// file.  Can be set by caller.
	ColumnNames []string

	// User-specified data types (maps column name to type name).
	TypeHintsName map[string]string

	// User-specified data types (indexed by column number).
	TypeHintsPos []string

	// The data type for each column.
	DataTypes []string
	// contains filtered or unexported fields
}

A CSVReader specifies how a data set in CSV format can be read from a text file.

func NewCSVReader

func NewCSVReader(r io.Reader) *CSVReader

NewReader returns a dataframe.CSVReader that reads CSV data from r with type inference and chunking.

func (*CSVReader) Read

func (rdr *CSVReader) Read(lines int) ([]*Series, error)

Read reads up lines rows of data and returns the results as an array of Series objects. If lines is negative the whole file is read. Data types of the Series objects are inferred from the file. Use type hints in the CSVReader struct to control the types directly.

type Column

type Column struct {
	// contains filtered or unexported fields
}

type SAS7BDAT

type SAS7BDAT struct {

	// Formats for the columns
	ColumnFormats []string

	// If true, trim whitespace from right of each string variable
	// (SAS7BDAT strings are fixed width)
	TrimStrings bool

	// If true, converts some date formats to Go date values (does
	// not work for all SAS date formats)
	ConvertDates bool

	// The creation date of the file
	DateCreated time.Time

	// The modification date of the file
	DateModified time.Time

	// The name of the data set
	Name string

	// The platform used to create the file
	Platform string

	// The SAS release used to create the file
	SASRelease string

	// The server type used to create the file
	ServerType string

	// The operating system type used to create the file
	OSType string

	// The operating system name used to create the file
	OSName string

	// The SAS file type
	FileType string

	// The encoding name
	FileEncoding string

	// True if the file was created on a 64 bit architecture
	U64 bool

	// The byte order of the file
	ByteOrder binary.ByteOrder

	// The compression mode of the file
	Compression string

	// A decoder for decoding text to unicode
	TextDecoder *xencoding.Decoder
	// contains filtered or unexported fields
}

SAS7BDAT represents a SAS data file in SAS7BDAT format.

func NewSAS7BDATReader

func NewSAS7BDATReader(r io.ReadSeeker) (*SAS7BDAT, error)

NewSAS7BDATReader returns a new reader object for SAS7BDAT files. Call the Read method to obtain the data.

func (*SAS7BDAT) ColumnNames

func (sas *SAS7BDAT) ColumnNames() []string

ColumnNames returns the names of the columns.

func (*SAS7BDAT) ColumnTypes

func (sas *SAS7BDAT) ColumnTypes() []int

ColumnTypes returns integer codes for the column data types.

func (*SAS7BDAT) Read

func (sas *SAS7BDAT) Read(num_rows int) ([]*Series, error)

Read returns up to num_rows rows of data from the SAS7BDAT file, as an array of Series objects. The Series data types are either float64 or string. If num_rows is negative, the remainder of the file is read. Returns (nil, io.EOF) when no rows remain.

SAS strings variables have a fixed width. By default, right whitespace is trimmed from each string, but this can be turned off by setting the TrimRight field in the SAS7BDAT struct.

func (*SAS7BDAT) RowCount

func (sas *SAS7BDAT) RowCount() int

RowCount returns the number of rows in the data set.

type Series

type Series struct {

	// A name describing what is in this series.
	Name string
	// contains filtered or unexported fields
}

A Series is a homogeneously-typed one-dimensional sequence of data values, with an optional mask for missing values.

func NewSeries

func NewSeries(name string, data interface{}, missing []bool) (*Series, error)

NewSeries returns a new Series object with the given name and data contents. The data parameter must be an array of floats, ints, or strings. The underlying data is not copied, so changes to data will impact the series.

func (*Series) AllClose

func (ser *Series) AllClose(other *Series, tol float64) (bool, int)

AllClose returns true, 0 if the Series is within tol of the other series. If the Series have different lengths, AllClose returns false, -1. If the Series have different types, AllClose returns false, -2. If the Series have the same type and the same length but are not equal, AllClose returns false, j, where j is the index of the first position where the two series differ.

func (*Series) AllEqual

func (ser *Series) AllEqual(other *Series) (bool, int)

AllEqual is equivalent to AllClose with tol=0.

func (*Series) AsFloat64Slice

func (ser *Series) AsFloat64Slice() ([]float64, []bool, error)

func (*Series) AsStringSlice

func (ser *Series) AsStringSlice() ([]string, []bool, error)

func (*Series) CountMissing

func (ser *Series) CountMissing() int

func (*Series) Data

func (ser *Series) Data() interface{}

Data returns the data component of the Series.

func (*Series) Date_from_duration

func (ser *Series) Date_from_duration(base time.Time, units string) (*Series, error)

func (*Series) ForceNumeric

func (ser *Series) ForceNumeric() *Series

ForceNumeric converts string values to float64 values, creating missing values where the conversion is not possible. If the data is not string type, it is unaffected.

func (*Series) Length

func (ser *Series) Length() int

Length returns the number of elements in a Series.

func (*Series) Missing

func (ser *Series) Missing() []bool

Missing returns the array of missing value indicators.

func (*Series) NullStringMissing

func (ser *Series) NullStringMissing() *Series

func (*Series) Print

func (ser *Series) Print()

Print prints the entire Series to the standard output.

func (*Series) PrintRange

func (ser *Series) PrintRange(first, last int)

PrintRange printes a slice of the Series to the standard output.

func (*Series) StringFunc

func (ser *Series) StringFunc(f func(string) string) *Series

func (*Series) ToString

func (ser *Series) ToString() *Series

func (*Series) UpcastNumeric

func (ser *Series) UpcastNumeric() *Series

UpcastNumeric converts in-place all numeric type variables to float64 values. Non-numeric data is not affected.

func (*Series) Write

func (ser *Series) Write(w io.Writer)

Write writes the entire Series to the given writer.

func (*Series) WriteRange

func (ser *Series) WriteRange(w io.Writer, first, last int)

WriteRange writes the given subinterval of the Series to the given writer.

type SeriesArray

type SeriesArray []*Series

SeriesArray is an array of pointers to Series objects. It can represent a dataset consisting of several variables.

func (SeriesArray) AllClose

func (ser SeriesArray) AllClose(other []*Series, tol float64) (bool, int, int)

AllClose returns (true, 0, 0) if all numeric values in corresponding columns of the two arrays of Series objects are within the given tolerance. If any corresponding columns are not identically equal, returns (false, j, i), where j is the index of a column and i is the index of a row where the two Series are not identical. If the two SeriesArray objects have different numbers of columns, returns (false, -1, -1). If column j of the two SeriesArray objects have different lengths, returns (false, j, -1). If column j of the two SeriesArray objects have different types, returns (false, j, -2)

func (SeriesArray) AllEqual

func (ser SeriesArray) AllEqual(other []*Series) (bool, int, int)

AllEqual is equivalent to AllClose with tol = 0.

type StataReader

type StataReader struct {

	// If true, the strl numerical codes are replaced with their
	// string values when available.
	InsertStrls bool

	// If true, the categorial numerical codes are replaced with
	// their string labels when available.
	InsertCategoryLabels bool

	// If true, dates are converted to Go date format.
	ConvertDates bool

	// A short text label for the data set.
	DatasetLabel string

	// The time stamp for the data set
	TimeStamp string

	// Number of variables
	Nvar int

	// An additional text entry describing each variable
	ColumnNamesLong []string

	// String labels for categorical variables
	ValueLabels     map[string]map[int32]string
	ValueLabelNames []string

	// Format codes for each variable
	Formats []string

	// Maps from strl keys to values
	Strls      map[uint64]string
	StrlsBytes map[uint64][]byte

	// The format version of the dta file
	FormatVersion int

	// The endian-ness of the file
	ByteOrder binary.ByteOrder
	// contains filtered or unexported fields
}

A StataReader reads Stata dta data files. Currently dta format versions 115, 117, and 118 can be read.

The Read method reads and returns the data. Several fields of the StataReader struct may also be of interest.

Technical information about the file format can be found here: http://www.stata.com/help.cgi?dta

func NewStataReader

func NewStataReader(r io.ReadSeeker) (*StataReader, error)

NewStataReader returns a StataReader for reading from the given io channel.

func (*StataReader) ColumnNames

func (rdr *StataReader) ColumnNames() []string

ColumnNames returns the names of the columns in the data file.

func (*StataReader) ColumnTypes

func (rdr *StataReader) ColumnTypes() []int

ColumnTypes returns integer codes corresponding to the data types in the Stata file. See the Stata dta doumentation for more information.

func (*StataReader) Read

func (rdr *StataReader) Read(rows int) ([]*Series, error)

Read returns the given number of rows of data from the Stata data file. The data are returned as an array of Series objects. If rows is negative, the remainder of the file is read.

func (*StataReader) RowCount

func (rdr *StataReader) RowCount() int

RowCount returns the number of rows in the data set.

type Statfilereader

type Statfilereader interface {
	ColumnNames() []string
	ColumnTypes() []int
	RowCount() int
	Read(int) ([]*Series, error)
}

Statfilereader is an interface that can be used to work interchangeably with StataReader and SAS7BDAT objects.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL