permissivecsv

package module
v0.0.0-...-fefa08e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 11, 2018 License: MIT Imports: 11 Imported by: 0

README

permissivecsv

Go Report Card Build Status Coverage GoDoc

PermissiveCSV is a CSV reader that reads non-standard-compliant CSVs. It allows for inconsistencies in the files in exchange for the consumer taking on responsibility for potential mis-reads.

Most CSV readers work from the assumption that the inbound CSV is standards-compliant. As such, typical CSV readers will return errors any time they are unable to parse a record or field due to things like terminator or delimiter inconsistency and field count mismatches.

However, in some use cases, a more permissive reader is desired. PermissiveCSV allows for certain inconsistencies in files by deferring responsibility for data validation to the caller. Instead of trying to enforce standards-compliance, PermissiveCSV instead makes its best judgement about what is happening in a file, and returns the most consistent results possible given those assumptions. Rather than returning errors, PermissiveCSV adjusts its output as best it can for consistency, and returns a Summary of any alterations that were made after scanning is complete.

Features

Sloppy-Terminator Support

PermissiveCSV will detect and read CSVs with either unix (\n), DOS (\r\n), inverted DOS (\n\r), or carriage return (\r) record terminators. Furthermore, the terminator is permitted to be inconsistent from record to record.

When scanning a search space for a terminator, PermissiveCSV will select the first non-quoted terminator it encounters using the following order:

  1. DOS (\r\n)
  2. Inverted DOS (\n\r)
  3. unix (\n)
  4. Carriage Return (\r) - a bare carriage return will only be selected if no other possible terminator exists within the current search space (even if the carriage return is found earlier in the space than other terminators).

Terminator evaluation order

  • PermissiveCSV doesn't make any a priori assumptions about a file-author's intent.
    • Terminators are evaluated solely on the context of the current search space within a file.
    • To accomplish detection, terminators are evaluated first by length, then by priority within a length.
    • Terminators can vary in byte-length, and terminators can be composites of eachother (for instance, a DOS terminator is a composite of a unix terminator and a carriage return).
    • The search algorithm gives priority to longer terminators, to ensure that it does not mistakenly select a terminator which is actually a sub-element of a larger composite terminator
      • Example: Selecting \n as the terminator, when the terminator was actually \r\n
  • Within each terminator length, a priority order is utilized.
    • Example: Between DOS and Inverted DOS, both of which have a length of two, DOS has priority.
    • Similarly, between unix and Carriage Return, both of which have a length of 1, unix has priority.

Ignoring Terminators

  • Terminators that fall anywhere inside a pair of double quotes are ignored.
  • Outside of double quotes, potential terminators are ignored only if a more likely terminator has been selected for the current record.
    • Example: If a potential record contains a carriage return and a newline separated by one or more other characters, the newline will be used as the terminator, and the carriage return will be ignored (even though it may not be quoted).
  • Leading terminators are ignored.
    • Leading terminators are one or more terminator token at the top of a file with no data present between tokens.
  • Dangling terminators are ignored.
    • Dangling terminators are one or more terminator token at the end of a file with no apparent data between tokens.
  • Stuttering terminators are ignored.
    • Stuttering terminators are two or more successive terminators with no intermediate data.

Inconsistent-Record-Length Handling

PermissiveCSV presumes that the number of fields in the first record of the file is the intended field count for the entire file. For all subsequent records:

  • If the number of fields is less than expected, blank fields are appended to the record.
  • If the number of fields is greater than expected, the right-hand side of the record is truncated, such that the number of fields matches the expected field count.

Botched-Quote Handling

PermissiveCSV handles two common forms of malformed quotes.

  • Bare quotes: "grib"flar,foo
  • Extraneous quotes: grib,"flar,foo

Bare and Extraneous quotes are handled similarly. In either of these conditions, PermissiveCSV will return a record that contains empty fields. See Inconsistent Field Handling for information about how the number of fields is deduced.

PermissiveCSV differs from how the standard library csv.Reader handles quote errors. In a csv.Reader, if lazy quotes is enabled, csv.Reader will push all of the data for the botched record into a single field. By contrast, PermissiveCSV instead returns a set of empty fields. This behavior ensures that the data returned for records with malformed quotes is as consistent as possible across all records who share the same issue. When PermissiveCSV encounters a malformed quote, that encounter, along with the original data, is made immediately available via the Summary method. This reinforces the Summary method as the central source for identifying and acting upon assumptions that PermissiveCSV has while scanning a file.

Header Detection

PermissiveCSV contains three header detection modes.

  1. Assume there is a header.
  2. Assume there is no header.
  3. Custom detection.
  // Example 1: Setting up a Scanner that assumes there is always a header.
  f, _ := os.Open("somefile.csv")
  s:= permissivecsv.NewScanner(f, permissivecsv.HeaderCheckAssumeHeaderExists)
  s.Scan()
  fmt.Print(s.RecordIsHeader())
  //output: false 
  // Example 2: Setting up a Scanner that assumes there is no header.
  f, _ := os.Open("somefile.csv")
  s:= permissivecsv.NewScanner(f, permissivecsv.HeaderCheckAssumeNoHeader)
  s.Scan()
  fmt.Print(s.RecordIsHeader()) 
  //output: true
  // Example 3: Custom detection logic: If first field is "address", it's a header.
  // This is a trivial example. See docs for more information about how the
  // HeaderCheck callback operates.
  headerCheck:= func(firstRecord, secondRecord []string) bool {
    return firstRecord[0] == "address"
  }
  f, _ := os.Open("somefile.csv")
  s:= permissivecsv.NewScanner(f, headerCheck)
  s.Scan()
  fmt.Print(s.RecordIsHeader()) 
  //output: true

Partitioning Support

PermissiveCSV contains a partition method which takes a desired partition size, and returns a slice of byte offsets which represent the beginning of each partition. Partitioning is guaranteed to work properly even if the file contains a mixture of record terminators.

"Errorless" Behavior

PermissiveCSV tries hard to avoid returning errors. Because it is permissive, it will do everything it can to return data in a consistent format.

In lieu of returning errors, PermissiveCSV has a Summary() method, which can be called after each call to Scan. Summary() returns an object with statistics about any actions that PermissiveCSV needed to take while reading the file in order to get it into a consistent shape.

For instance, any time a record is appended or truncated as the result of being an unexpected length, the altered record number and operation type (append or truncate) is noted, and reported via the Summary() method after the Scan is complete.

PermissiveCSV has no control over the reader that has been supplied by the caller. If the underlaying reader returns an error, that error will be made available via the Summary().Err value. Outside of that, PermissiveCSV will not return any errors so long as the supplied reader continues to supply data.

Documentation

Overview

Package permissivecsv provides facilties for permissively reading non-standards-compliant csv files.

Index

Examples

Constants

View Source
const (
	// AltBareQuote is the description for bare-quote record alterations.
	AltBareQuote = "bare quote"

	// AltExtraneousQuote is the description for extraneous-quote record alterations.
	AltExtraneousQuote = "extraneous quote"

	// AltTruncatedRecord is the description for truncated record alterations.
	AltTruncatedRecord = "truncated record"

	// AltPaddedRecord is the description for padded record alterations.
	AltPaddedRecord = "padded record"
)

Variables

View Source
var (
	// ErrReaderIsNil is returned in the Summary if Scan is called but the
	// reader that the Scanner was initialized with is nil.
	ErrReaderIsNil = fmt.Errorf("reader is nil")
)

Functions

This section is empty.

Types

type Alteration

type Alteration struct {
	RecordOrdinal         int
	OriginalData          string
	ResultingRecord       []string
	AlterationDescription string
}

Alteration describes a change that the Scanner made to a record because the record was in an unexpected format.

type HeaderCheck

type HeaderCheck func(firstRecord []string) bool

HeaderCheck is a function that evaluates whether or not firstRecord is a header. HeaderCheck is called by the RecordIsHeader method, and is supplied values according to the current state of the Scanner.

firstRecord is the first record of the file. firstRecord will be nil in the following conditions:

  • Scan has not been called.
  • The file is empty.
  • The Scanner has advanced beyond the first record.
var HeaderCheckAssumeHeaderExists HeaderCheck = func(firstRecord []string) bool {
	return firstRecord != nil
}

HeaderCheckAssumeHeaderExists returns true unless firstRecord is nil.

var HeaderCheckAssumeNoHeader HeaderCheck = func(firstRecord []string) bool {
	return false
}

HeaderCheckAssumeNoHeader is a HeaderCheck that instructs the RecordIsHeader method to report that no header exists for the file being scanned.

type ScanSummary

type ScanSummary struct {
	RecordCount     int
	AlterationCount int
	Alterations     []*Alteration
	EOF             bool
	Err             error
}

ScanSummary contains information about assumptions or alterations that have been made via any calls to Scan.

func (*ScanSummary) String

func (s *ScanSummary) String() string

String returns a prettified representation of the summary.

type Scanner

type Scanner struct {
	// contains filtered or unexported fields
}

Scanner provides methods for permissively reading CSV input. Successive calls to the Scan method will step through the records of a file.

Terminators (line endings) can be any (or a mix) of DOS (\r\n), inverted DOS (\n\r), unix (\n), or carriage return (\r) tokens. When scanning, the scanner looks for the next occurence of any known token within a search space.

Any tokens that fall within a pair of double quotes are ignored.

If no tokens are found within the current search space, the space is expanded until either a token or EOF is reached.

If only one token is found in the current space, that token is presumed to be the terminator for the current record.

If more than one potential token is identified in the current space, the Scanner will select the first, non-quoted, highest priority token. The Scanner first gives priority to token length. Longer tokens have higher priority than shorter tokens. This priority avoids lexographical confusion between shorter tokens and longer tokens that are actually composites of the shorter tokens. Thus, DOS and inverted DOS terminators have highest priority, as they are longer than unix or carriage return terminators. Between two or more tokens of the same length, the Scanner gives priority to tokens that are more common. Thus DOS has higher priority than inverted DOS because inverted DOS is a non-standard terminator. Similarly between unix and carriage return, unix has priority, as bare carriage returns are a non-standard terminator. Finally, since carriage returns are quite rare as terminators, a carriage return will only be selected if there are no other possible terminators present in the current search space.

The preceding terminator detection process is repeated for each record that is scanned.

Once a record is identified, it is split into fields using standard CSV encoding rules. A mixture of quoted and unquoted field values is permitted, and fields are presumed to be separated by commas. The first record scanned is always presumed to have the correct number of fields. For each subsequent record, if the record has fewer fields than expected, the scanner will pad the record with blank fields to accommodate the missing data. If the record has more fields than expected, the scanner will truncate the record so its length matches the desired length. Information about padded or truncated records is made available via the Summary method once scanning is complete.

When parsing the fields of a record, the Scanner might encounter ambiguous double quotes. Two common quote ambiguities are handled by the Scanner. 1) Bare-Quotes, where a field contains two quotes, but also appears to have data outside of the quotes. 2) Extraneous-Quotes, where a record appears to have an odd number of quotes, making it impossible to determine if a quote was left unclosed, or if the extraneous quote was supposed to be escaped. If the Scanner encounters quotes that are ambiguous, it will return empty fields in place of any data that might have been present, as the Scanner is unable to make any assumptions about the author's intentions. When such replacements are made, the type of replacement, record number, and original data are all immediately available via the Summary method.

func NewScanner

func NewScanner(r io.Reader, headerCheck HeaderCheck) *Scanner

NewScanner returns a new Scanner to read from r.

func (*Scanner) CurrentRecord

func (s *Scanner) CurrentRecord() []string

CurrentRecord returns the most recent record generated by a call to Scan.

func (*Scanner) Partition

func (s *Scanner) Partition(n int, excludeHeader bool) []*Segment

Partition reads the full file and divides it into a series of partitions, each of which contains n non-empty records. All partitions are guaranteed to contain at least n non-empty records, except for the final partition, which may contain a smaller number of records.

Each partition is represented by a Segment, which contains an Ordinal (an integer value representing the segment's placement relative to other segments), the lower byte offset where the partition starts, and the segment lengh, which is the partition size in bytes. If the file being read is empty (0 bytes), Partition will return an empty slice of segments.

If excludeHeader is true, Partition will check if a header exists. If a header is detected, the first Segment will ignore the header, and the LowerOffset value will be the first byte position after the header record.

If excludeHeader is false, the LowerOffset of the first segment will always be 0 (regardless of whether the first record is a header or not).

Partition is designed to be used in conjunction with byte offset seekers such as os.File.Seek or bufio.ReadSeeker.Discard in situations where files need to be accessed in a concurrent manner.

Before processing, Partition explicitly resets the underlaying reader to the top of the file. Thus, using Partition in conjunction with Scan could have undesired results.

Example

Note that, in this example, we are assuming the header exists, and are also instructing Partition to exclude the header from the segments. This is why segment 1 starts at offset 6, just after the header record.

package main

import (
	"encoding/json"
	"fmt"
	"strings"

	"github.com/eltorocorp/permissivecsv"
)

func main() {
	data := strings.NewReader("a,b,c\nd,e,f\ng,h,i\nj,k,l\n")
	s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeHeaderExists)
	recordsPerPartition := 2
	excludeHeader := true
	partitions := s.Partition(recordsPerPartition, excludeHeader)

	// serializing to JSON just to prettify the output.
	segmentJSON, _ := json.MarshalIndent(partitions, "", "  ")
	fmt.Println(string(segmentJSON))
}
Output:

[
  {
    "Ordinal": 1,
    "LowerOffset": 6,
    "Length": 12
  },
  {
    "Ordinal": 2,
    "LowerOffset": 18,
    "Length": 6
  }
]

func (*Scanner) RecordIsHeader

func (s *Scanner) RecordIsHeader() bool

RecordIsHeader returns true if the current record has been identified as a header. RecordIsHeader determines if the current record is a header by calling the HeaderCheck callback which was supplied to NewScanner when the Scanner was instantiated.

Example (AssumeHeaderExists)
package main

import (
	"fmt"
	"strings"

	"github.com/eltorocorp/permissivecsv"
)

func main() {
	data := strings.NewReader("a,b,c\nd,e,f")
	s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeHeaderExists)
	for s.Scan() {
		fmt.Println(s.RecordIsHeader())
	}
}
Output:

true
false
Example (AssumeNoHeader)
package main

import (
	"fmt"
	"strings"

	"github.com/eltorocorp/permissivecsv"
)

func main() {
	data := strings.NewReader("a,b,c\nd,e,f")
	s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeNoHeader)
	for s.Scan() {
		fmt.Println(s.RecordIsHeader())
	}
}
Output:

false
false
Example (CustomDetection)

This example demonstrates implementing custom header detection logic. The example shows how to properly check for nil conditions, and how the first record of a file can be evaluated when making a determination about if the first record is a header. This is a fairly trivial example of header detection. Review the HeaderCheck docs for a full list of implementation considerations.

package main

import (
	"fmt"
	"strings"

	"github.com/eltorocorp/permissivecsv"
)

func main() {
	headerCheck := func(firstRecord []string) bool {
		// firstRecord will be nil if Scan has not been called, if the file is
		// empty, or the Scanner has advanced beyond the first record.
		if firstRecord == nil {
			return false
		}

		return firstRecord[0] == "a"
	}

	data := strings.NewReader("a,b,c\nd,e,f")
	s := permissivecsv.NewScanner(data, headerCheck)
	for s.Scan() {
		fmt.Println(s.RecordIsHeader())
	}
}
Output:

true
false

func (*Scanner) Reset

func (s *Scanner) Reset()

Reset sets the Scanner and clears any summary data that any previous calls to Scan may have generated. Note that since Scanner is based on a Reader, it is necessary for the consumer to verify the position in the byte stream from which the Scanner will read.

func (*Scanner) Scan

func (s *Scanner) Scan() bool

Scan advances the scanner to the next non-empty record, which is then available via the CurrentRecord method. Scan returns false when it reaches the end of the file. Once scanning is complete, subsequent scans will continue to return false until the Reset method is called.

Scan skips what it considers "empty records". An empty record occurs any time one or more terminators are present with no surrounding data.

If the underlaying Reader is nil, Scan will return false on the first call. In all other cases, Scan will return true on the first call. This is done to allow the caller to explicitely inspect the resulting record (even if said record is empty).

Example
package main

import (
	"fmt"
	"strings"

	"github.com/eltorocorp/permissivecsv"
)

func main() {
	data := strings.NewReader("a,b,c/nd,e,f")
	s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeNoHeader)
	for s.Scan() {
		fmt.Println(s.CurrentRecord())
	}
}
Output:

[a b c/nd e f]

func (*Scanner) Summary

func (s *Scanner) Summary() *ScanSummary

Summary returns a summary of information about the assumptions or alterations that were made during the most recent Scan. If the Scan method has not been called, or Reset was called after the last call to Scan, Summary will return nil. Summary will continue to collect data each time Scan is called, and will only reset after the Reset method has been called.

Example
package main

import (
	"fmt"
	"strings"

	"github.com/eltorocorp/permissivecsv"
)

func main() {
	data := strings.NewReader("a,b,c\nd,ef\ng,h,i")
	s := permissivecsv.NewScanner(data, permissivecsv.HeaderCheckAssumeHeaderExists)
	for s.Scan() {
		continue
	}
	summary := s.Summary()
	fmt.Println(summary.String())
}
Output:

Scan Summary
---------------------------------------
  Records Scanned:    3
  Alterations Made:   1
  EOF:                true
  Err:                none
  Alterations:
    Record Number:    2
    Alteration:       padded record
    Original Data:    d,ef
    Resulting Record: ["d","ef",""]

type Segment

type Segment struct {
	Ordinal     int64
	LowerOffset int64
	Length      int64
}

Segment represents a byte range within a file that contains a subset of records.

Directories

Path Synopsis
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL