xsv

package

v0.0.0-...-86e9f11 Latest Latest Go to latest Published: Jan 7, 2024 License: Apache-2.0 Imports: 13 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/SnellerInc/sneller

Links

Open Source Insights

Documentation ¶

Overview ¶

Package xsv implements parsing/converting CSV (RFC 4180) and TSV (tab separated values) files to binary ION format.

Index ¶

Constants
Variables
func Convert(r io.Reader, dst *ion.Chunker, ch RowChopper, hint *Hint, cons []ion.Field) error
type CsvChopper
- func (c *CsvChopper) GetNext(r io.Reader) ([]string, error)
type Delim
- func (d *Delim) UnmarshalJSON(b []byte) error
type FieldHint
- func (fh *FieldHint) UnmarshalJSON(data []byte) error
type Hint
- func ParseHint(hint []byte) (*Hint, error)
type RowChopper
type TsvChopper
- func (c *TsvChopper) GetNext(r io.Reader) ([]string, error)

Constants ¶

View Source

const (
	TypeIgnore   = "ignore"
	TypeString   = "string" // default
	TypeNumber   = "number" // also floating point
	TypeInt      = "int"    // integer only
	TypeBool     = "bool"
	TypeDateTime = "datetime"
)

View Source

const (
	FormatDateTime             = "datetime" // default
	FormatDateTimeUnixSec      = "unix_seconds"
	FormatDateTimeUnixMilliSec = "unix_milli_seconds"
	FormatDateTimeUnixMicroSec = "unix_micro_seconds"
	FormatDateTimeUnixNanoSec  = "unix_nano_seconds"
)

Variables ¶

View Source

var (
	ErrIngestEmptyOnlyValidForStrings = errors.New("only strings can be empty")
	ErrFormatOnlyValidForDateTime     = errors.New("format only valid for datetime type")
	ErrBoolValuesOnlyValidForBool     = errors.New("custom true/false values only valid for bool type")
	ErrRequireBothTrueAndFalseValues  = errors.New("require both true and false values")
	ErrTrueAndFalseValuesOverlap      = errors.New("true and values values overlap")
)

View Source

var (
	ErrNoHints = errors.New("hints are mandatory")
)

Functions ¶

func Convert ¶

func Convert(r io.Reader, dst *ion.Chunker, ch RowChopper, hint *Hint, cons []ion.Field) error

Convert reads all records from the reader using the specified chopper/hints to determine the individual fields and writes it to the ION chunker

Types ¶

type CsvChopper ¶

type CsvChopper struct {
	// SkipRecords allows skipping the first
	// N records (useful when headers are used)
	SkipRecords int
	// Separator allows specifying a custom
	// separator (defaults to comma)
	Separator Delim
	// contains filtered or unexported fields
}

CsvChopper reads a CSV formatted file (RFC 4180) and splits each line in the individual fields.

func (*CsvChopper) GetNext ¶

func (c *CsvChopper) GetNext(r io.Reader) ([]string, error)

GetNext fetches one CSV record and returns the individual columns. Due to quoting a CSV record may span multiple lines of text.

type Delim ¶

type Delim rune

Delim is a rune that unmarshals from a string.

func (*Delim) UnmarshalJSON ¶

func (d *Delim) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler.

type FieldHint ¶

type FieldHint struct {
	// Field-name (use dots to make it a subfield)
	Name string `json:"name,omitempty"`
	// Type of field (or ignore)
	Type string `json:"type,omitempty"`
	// Default value if the column is an empty string
	Default string `json:"default,omitempty"`
	// Ingestion format (i.e. different data formats)
	Format string `json:"format,omitempty"`
	// Allow empty values (only valid for strings) to
	// be ingested. If flag is set to false, then the
	// field won't be written for the record instead.
	AllowEmpty bool `json:"allow_empty,omitempty"`
	// Don't use sparse-indexing for this value.
	// (only valid for date-time type)
	NoIndex bool `json:"no_index,omitempty"`
	// Optional list of values that represent TRUE
	// (only valid for bool type)
	TrueValues []string `json:"true_values,omitempty"`
	// Optional list of values that represent FALSE
	// (only valid for bool type)
	FalseValues []string `json:"false_values,omitempty"`
	// Optional list of values that represent a
	// missing value
	MissingValues []string `json:"missing_values,omitempty"`
	// contains filtered or unexported fields
}

FieldHint defines if and how a field should be imported

func (*FieldHint) UnmarshalJSON ¶

func (fh *FieldHint) UnmarshalJSON(data []byte) error

type Hint ¶

type Hint struct {
	// SkipRecords allows skipping the first
	// N records (useful when headers are used)
	SkipRecords int `json:"skip_records,omitempty"`
	// Separator allows specifying a custom
	// separator (only applicable for CSV)
	Separator Delim `json:"separator,omitempty"`
	// MissingValues is an optional list of
	// strings which represent missing values.
	// Entries in Fields may override this on a
	// per-field basis.
	MissingValues []string `json:"missing_values,omitempty"`
	// Fields specifies the hint for each field
	Fields []FieldHint `json:"fields"`
}

Hint specifies the options and mandatory fields for parsing CSV/TSV files.

func ParseHint ¶

func ParseHint(hint []byte) (*Hint, error)

ParseHint parses a json byte array into a Hint structure which can later be used to pass type-hints and/or other flags to the TSV parser.

The input must contain a valid JSON object, like:

{
  "fields": [
    {"name":"field", "type": "<type>"},
    {"name":"field.a", "type": "<type>", "default:" "empty"},
    {"name":"field.b", "type": "datetime", "format": "epoch", "no_index": true},
    {"name":"anotherField", "type": "bool", "true_values": ["Y"], "false_values": ["N"]},
    ...
  ]
}

With TSV each line represents a single record. The tab character is used to split the line into multiple fields. The 'fields' part in the hints is an order list that specify the name and type of each field.

Each field will be given the specified 'name'. If no 'type' is specified then 'string' is assumed. When there are more fields in the data, then in the 'fields', then these are skipped.

If a field doesn't need to be ingested, then you can insert an empty record (or set the 'type' to "ignore" explicitly).

When there is no text between both tabs, the structure won't contain the field, unless a 'default' is specified (can be an empty string). Note that the default value should match the type.

Note that the 'name' can contain multiple levels, so nested objects can be created. This can be useful to group information in the ingested data.

Some values may be included in the sparse index. Set the 'no_index' field to `true` to prevent this behavior for the field.

Supported types:

string -> set 'allow_empty' if you want empty strings to be ingested
number -> either float or int
int
bool -> can support custom true/false values
datetime -> formats: text (default), epoch, epoch_ms, epoch_us, epoch_ns

type RowChopper ¶

type RowChopper interface {
	// GetNext return the next record and
	// splits fields in individual columns
	GetNext(r io.Reader) ([]string, error)
}

RowChopper implements fetching records row-by-row and chopping the records into individual fields until the reader is exhausted

type TsvChopper ¶

type TsvChopper struct {
	// SkipRecords allows skipping the first
	// N records (useful when headers are used)
	SkipRecords int
	// contains filtered or unexported fields
}

TsvChopper reads a TSV formatted file and splits each line in the individual fields. TSV format differs from CSV, because it doesn't support quoting to allow non-standard characters, but uses escape sequences (i.e. \t, \r or \n)

func (*TsvChopper) GetNext ¶

func (c *TsvChopper) GetNext(r io.Reader) ([]string, error)

GetNext fetches one TSV line and returns the individual columns. Each TSV record is always exactly one line.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL