xsv

package
v0.0.0-...-86e9f11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 7, 2024 License: Apache-2.0 Imports: 13 Imported by: 0

Documentation

Overview

Package xsv implements parsing/converting CSV (RFC 4180) and TSV (tab separated values) files to binary ION format.

Index

Constants

View Source
const (
	TypeIgnore   = "ignore"
	TypeString   = "string" // default
	TypeNumber   = "number" // also floating point
	TypeInt      = "int"    // integer only
	TypeBool     = "bool"
	TypeDateTime = "datetime"
)
View Source
const (
	FormatDateTime             = "datetime" // default
	FormatDateTimeUnixSec      = "unix_seconds"
	FormatDateTimeUnixMilliSec = "unix_milli_seconds"
	FormatDateTimeUnixMicroSec = "unix_micro_seconds"
	FormatDateTimeUnixNanoSec  = "unix_nano_seconds"
)

Variables

View Source
var (
	ErrIngestEmptyOnlyValidForStrings = errors.New("only strings can be empty")
	ErrFormatOnlyValidForDateTime     = errors.New("format only valid for datetime type")
	ErrBoolValuesOnlyValidForBool     = errors.New("custom true/false values only valid for bool type")
	ErrRequireBothTrueAndFalseValues  = errors.New("require both true and false values")
	ErrTrueAndFalseValuesOverlap      = errors.New("true and values values overlap")
)
View Source
var (
	ErrNoHints = errors.New("hints are mandatory")
)

Functions

func Convert

func Convert(r io.Reader, dst *ion.Chunker, ch RowChopper, hint *Hint, cons []ion.Field) error

Convert reads all records from the reader using the specified chopper/hints to determine the individual fields and writes it to the ION chunker

Types

type CsvChopper

type CsvChopper struct {
	// SkipRecords allows skipping the first
	// N records (useful when headers are used)
	SkipRecords int
	// Separator allows specifying a custom
	// separator (defaults to comma)
	Separator Delim
	// contains filtered or unexported fields
}

CsvChopper reads a CSV formatted file (RFC 4180) and splits each line in the individual fields.

func (*CsvChopper) GetNext

func (c *CsvChopper) GetNext(r io.Reader) ([]string, error)

GetNext fetches one CSV record and returns the individual columns. Due to quoting a CSV record may span multiple lines of text.

type Delim

type Delim rune

Delim is a rune that unmarshals from a string.

func (*Delim) UnmarshalJSON

func (d *Delim) UnmarshalJSON(b []byte) error

UnmarshalJSON implements json.Unmarshaler.

type FieldHint

type FieldHint struct {
	// Field-name (use dots to make it a subfield)
	Name string `json:"name,omitempty"`
	// Type of field (or ignore)
	Type string `json:"type,omitempty"`
	// Default value if the column is an empty string
	Default string `json:"default,omitempty"`
	// Ingestion format (i.e. different data formats)
	Format string `json:"format,omitempty"`
	// Allow empty values (only valid for strings) to
	// be ingested. If flag is set to false, then the
	// field won't be written for the record instead.
	AllowEmpty bool `json:"allow_empty,omitempty"`
	// Don't use sparse-indexing for this value.
	// (only valid for date-time type)
	NoIndex bool `json:"no_index,omitempty"`
	// Optional list of values that represent TRUE
	// (only valid for bool type)
	TrueValues []string `json:"true_values,omitempty"`
	// Optional list of values that represent FALSE
	// (only valid for bool type)
	FalseValues []string `json:"false_values,omitempty"`
	// Optional list of values that represent a
	// missing value
	MissingValues []string `json:"missing_values,omitempty"`
	// contains filtered or unexported fields
}

FieldHint defines if and how a field should be imported

func (*FieldHint) UnmarshalJSON

func (fh *FieldHint) UnmarshalJSON(data []byte) error

type Hint

type Hint struct {
	// SkipRecords allows skipping the first
	// N records (useful when headers are used)
	SkipRecords int `json:"skip_records,omitempty"`
	// Separator allows specifying a custom
	// separator (only applicable for CSV)
	Separator Delim `json:"separator,omitempty"`
	// MissingValues is an optional list of
	// strings which represent missing values.
	// Entries in Fields may override this on a
	// per-field basis.
	MissingValues []string `json:"missing_values,omitempty"`
	// Fields specifies the hint for each field
	Fields []FieldHint `json:"fields"`
}

Hint specifies the options and mandatory fields for parsing CSV/TSV files.

func ParseHint

func ParseHint(hint []byte) (*Hint, error)

ParseHint parses a json byte array into a Hint structure which can later be used to pass type-hints and/or other flags to the TSV parser.

The input must contain a valid JSON object, like:

{
  "fields": [
    {"name":"field", "type": "<type>"},
    {"name":"field.a", "type": "<type>", "default:" "empty"},
    {"name":"field.b", "type": "datetime", "format": "epoch", "no_index": true},
    {"name":"anotherField", "type": "bool", "true_values": ["Y"], "false_values": ["N"]},
    ...
  ]
}

With TSV each line represents a single record. The tab character is used to split the line into multiple fields. The 'fields' part in the hints is an order list that specify the name and type of each field.

Each field will be given the specified 'name'. If no 'type' is specified then 'string' is assumed. When there are more fields in the data, then in the 'fields', then these are skipped.

If a field doesn't need to be ingested, then you can insert an empty record (or set the 'type' to "ignore" explicitly).

When there is no text between both tabs, the structure won't contain the field, unless a 'default' is specified (can be an empty string). Note that the default value should match the type.

Note that the 'name' can contain multiple levels, so nested objects can be created. This can be useful to group information in the ingested data.

Some values may be included in the sparse index. Set the 'no_index' field to `true` to prevent this behavior for the field.

Supported types:

  • string -> set 'allow_empty' if you want empty strings to be ingested
  • number -> either float or int
  • int
  • bool -> can support custom true/false values
  • datetime -> formats: text (default), epoch, epoch_ms, epoch_us, epoch_ns

type RowChopper

type RowChopper interface {
	// GetNext return the next record and
	// splits fields in individual columns
	GetNext(r io.Reader) ([]string, error)
}

RowChopper implements fetching records row-by-row and chopping the records into individual fields until the reader is exhausted

type TsvChopper

type TsvChopper struct {
	// SkipRecords allows skipping the first
	// N records (useful when headers are used)
	SkipRecords int
	// contains filtered or unexported fields
}

TsvChopper reads a TSV formatted file and splits each line in the individual fields. TSV format differs from CSV, because it doesn't support quoting to allow non-standard characters, but uses escape sequences (i.e. \t, \r or \n)

func (*TsvChopper) GetNext

func (c *TsvChopper) GetNext(r io.Reader) ([]string, error)

GetNext fetches one TSV line and returns the individual columns. Each TSV record is always exactly one line.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL