formats

package
v0.0.0-...-457a45f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 5, 2020 License: GPL-3.0 Imports: 11 Imported by: 0

README

Databio - formats

Long-term goals for databio's format support:

  1. Pluggable formats enable any data interchange format to be supported for both Read and Write operations.

    • Break everything into Records and Fields.
    • Methods to display "location" information to allow debugging data (e.g. line 42, column 9 of CSV versus Cell C42 in Excel)
  2. Generic formats e.g. CSV have support for heuristic header detection.

    • Example heuristic #1: Check the values of the first row to see if the entropy of values is different from others. (e.g. "GeneID" has a wholly different set of characters than "1234")
    • Example heuristic #2: scan N rows and compare entropy of columns to look for a breakpoint. (e.g. multiple-line headers)
  3. Database of headers to allow known, popular format types to be quickly detected without heuristics.

Documentation

Overview

Package formats describes the supported data input/output formats. The goal of this package is to support iteration of records (containing fields) within any data set:

+----------------------------+
| Data Set                   |
| +------------------------+ |
| | Record 1               | |
| | Field 1 | Field 2| ... | |
| +------------------------+ |
| +------------------------+ |
| | Record 2               | |
| | Field 1 | Field 2| ... | |
| +------------------------+ |
| +------------------------+ |
| | Record 3               | |
| | Field 1 | Field 2| ... | |
| +------------------------+ |
+----------------------------+

Data sets will have multiple records, records can have multiple fields, and fields can have multiple values.

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrUnsupportedFormat indicates that the file format is not supported.
	ErrUnsupportedFormat = errors.New("databio/formats: unsupported format")

	// ErrWriterNotSupported is returned when Writer is not implemented for a Format.
	ErrWriterNotSupported = errors.New("databio/formats: Writer not supported for this format")
)

Functions

func Register

func Register(f *Format) int

Register a Format for inclusion in any subsequent data import/export tasks. Returns the number of formats currently registered, thus it can be used as a global initializer by ignoring the result:

var _ = formats.Register(formats.Format{...})

Types

type CSV

type CSV struct {
	// contains filtered or unexported fields
}

CSV supports reading tabular records from an csv file.

func OpenCSV

func OpenCSV(in io.ReadSeeker) (*CSV, error)

OpenCSV opens a csv document and returns a formats.Reader. An io.ReadSeeker is required due to header detection readahead.

func (*CSV) Err

func (x *CSV) Err() error

Err returns the last error that occured.

func (*CSV) Next

func (x *CSV) Next() (Record, error)

Next returns the next Record in the document. (Implements the formats.Reader interface)

type Format

type Format struct {
	// Name of the Format used for locating the Reader/Writer to use.
	Name string

	// Description of the Format used for selection lists.
	Description string

	// Extensions lists the file extensions that typicall denote this Format.
	// Note each extension MUST begin with a "." dot prefix.
	Extensions []string

	// MediaTypes lists the IANA Media/MIME types supported by the Format.
	MediaTypes []string

	// Detect if the given (possibly incomplete) data is supported.
	//    Supported = true if this Format will work for the data.
	//    More = true if more data may help detection.
	// Note that if Supported=false, More=true and you have provided
	// the entire contents then the data format is either truncated
	// or not supported.
	Detect func(data []byte, incomplete bool) (supported bool, more bool)

	// NewReader returns a new format Reader for the given stream.
	NewReader func(r io.Reader) (Reader, error)

	// NewWriter returns a new format Writer applied to the given stream.
	NewWriter func(w io.Writer) (Writer, error)
}

Format describes a supported data interchange protocol.

type Reader

type Reader interface {
	// Next returns the next Record in the document.
	Next() (Record, error)

	// Err returns the last error that occured.
	Err() error
}

Reader returns Records from a supported Format.

func Open

func Open(in *os.File) (Reader, error)

Open returns a Reader for the input file if it detects that it is in the supported Format. Returns ErrUnsupportedFormat is the Format is not detected.

type Record

type Record interface {
	// Each iterates over every field/value pair in the Record.
	Each(func(field, value string) error) error

	// Fields returns the a list of Fields present in this Record.
	Fields() []string

	// Values returns the values associated with a named Field.
	Values(field string) []string

	// Set the values for a named Field in the Record.
	Set(field string, values []string)
}

Record represents a single record sourced from the Format.

type TSV

type TSV struct {
	// contains filtered or unexported fields
}

TSV supports reading tabular records from a TSV file.

func OpenTSV

func OpenTSV(in io.ReadSeeker) (*TSV, error)

OpenTSV opens a TSV document and returns a formats.Reader. An io.ReadSeeker is required due to header detection readahead.

func (*TSV) Err

func (x *TSV) Err() error

Err returns the last error that occured.

func (*TSV) Next

func (x *TSV) Next() (Record, error)

Next returns the next Record in the document. (Implements the formats.Reader interface)

type Writer

type Writer interface {
	// Write serializes the Record.
	Write(Record) error

	// Err returns the last error that occured.
	Err() error
}

Writer serializes records to a supported Format.

type XLSX

type XLSX struct {
	// contains filtered or unexported fields
}

XLSX supports reading tabular records from an excel file.

func OpenXLSX

func OpenXLSX(in io.Reader) (*XLSX, error)

OpenXLSX opens an excel document and returns a formats.Reader.

func (*XLSX) Err

func (x *XLSX) Err() error

Err returns the last error that occured.

func (*XLSX) Next

func (x *XLSX) Next() (Record, error)

Next returns the next Record in the document. (Implements the formats.Reader interface)

func (*XLSX) NextSheet

func (x *XLSX) NextSheet() error

NextSheet moves to the next Sheet in an excel document.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL