seesv

package module
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 6, 2021 License: MIT Imports: 7 Imported by: 1

README

SeeSV

SeeSV is a library for providing fast access to data in very large delimited data files (CSV, TSV, pipe-delimited, etc) in a memory-efficient manner.

This library grew out of the frustration of reading and analyzing very large (multi-gigabyte, 5+ million row) CSV files (and other delimited file formats). There are few options for reading these files to quickly find problems, generate a filtered subset, or find specific data. There was the constant need for both a command line tool that can be used on a server via an SSH terminal connection and a desktop tool that can quickly churn through huge files. There are many tools out there, but most fall flat when handling very large files - either they take forever to open the file or (and) end up crashing by running out of memory.

The goal with SeeSV is to provide a re-usable library that encapsulates handling delimited files, that will enable building cross-platform, command line and desktop GUI applications that provide the user with super fast viewing of data files.

Design

SeeSV Features
  • Blazing fast loading of files so that the user gets to a productive state with the data within seconds.
  • Minimal memory consumption even for extremely large files.
  • The ability to jump to any area of a file in O(1) time, then provide a bounded or unbounded stream of parsed records from that point.
  • Handle files with or without headers, and files with extra header lines (like file summary metadata, etc) that should be ignored.
  • Simple, intuitive API.
Design

When a file is opened, SeeSV performs a number of discovery tasks:

  1. Skip past unwanted rows at the top of the file.
  2. Extract the column headers, if the file contains any. The headers are stored in a list accessible as a property of the DelimitedFile object.
  3. Scans the file to generate an internal index of the byte positions of the start of every line (row) in the file, excluding the headers. This index is the only aspect of SeeSV that may use a significant, though minimal, amount of memory. In testing, the scan of a 5 million row file (2.3GB) took around 4 - 6 seconds and produced an index using 18MB of memory. The index enables a constant-time seek to any part of a file by row number.
  4. As a consequence of step 3, the row count is obtained and made available as a property of the DelimitedFile object.
  5. File size is also made available through a property.

Examples

Example 1: Open a normal file that has a header line:

package main

import (
    "fmt"
    "davealexis/seesv"
)

func main() {
    var csvFile seesv.DelimitedFile
    err := csvFile.Open("testdata/test.csv", 0, true)
    if err != nil {
        log.Fatal("Failed to open file")
    }
    defer csvFile.File.Close()

    fmt.Println("The file has", csvFile.RowCount, "rows. Size is", csvFile.Size, "bytes.")

    fmt.Println(csvFile.Headers)

    for row := csvFile.Rows(0, -1) {
        // Do something with row
    }
}

Example 2: Similar to Example 1, except the file contains two extra metadata lines at the top before the column headers that we want to skip:

package main

import (
    "fmt"
    "davealexis/seesv"
)

func main() {
    var csvFile seesv.DelimitedFile
    err := csvFile.Open("testdata/test.csv", 2, true)
    if err != nil {
        log.Fatal("Failed to open file")
    }
    defer csvFile.File.Close()

    fmt.Println(csvFile.Headers)

    // Get 1,000 rows starting from row 25,000
    for row := csvFile.Rows(25_000, 1_000) {
        // Do something with row
    }
}

Example 3: The file does not contain a header row, so we just want access to the data:

package main

import (
    "fmt"
    "davealexis/seesv"
)

func main() {
    var csvFile seesv.DelimitedFile
    err := csvFile.Open("testdata/test.csv", 0, false)
    if err != nil {
        log.Fatal("Failed to open file")
    }
    defer csvFile.File.Close()

    for row := csvFile.Rows(0, -1) {
        // Do something with row
    }
}

Example 4: We just want to get a single row from the file:

package main

import (
    "fmt"
    "davealexis/seesv"
)

func main() {
    var csvFile seesv.DelimitedFile
    err := csvFile.Open("testdata/test.csv", 0, false)
    if err != nil {
        log.Fatal("Failed to open file")
    }
    defer csvFile.File.Close()

    // Get row 10
    row := csvFile.Row(10) {
    // Do something with row
}

Roadmap

The following are some of the features that are coming:

  • Automatically detect column data types
  • Allow user to supply column schema.
  • Support for file formats other than CSV:
    • Tab-separated
    • Pipe-delimited
    • JSON (?)
    • Compressed files (e.g. myfile.csv.gz)
  • Auto-detect which line contains headers (e.g. ignore any metadata rows at the top of the file)
  • Filters
  • Projections - get specified columns instead of all columns
TODO:
  • Column map - csfVile.Columns['NAME'].ID |.Type|etc
  • Progress and error reporter channels
  • Pipe delimited files
  • Tab delimited files
  • Fixed length files
  • Named access to column data, like Python's DictReader

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type DelimitedFile

type DelimitedFile struct {
	File     *os.File
	RowCount int64
	RowIndex []int64
	Headers  []string
	Size     int64
}

DelimitedFile holds metadata about the file being worked with, and exposes methods to work with the file.

func (*DelimitedFile) Close

func (df *DelimitedFile) Close()

................................................................................................

func (*DelimitedFile) Open

func (df *DelimitedFile) Open(filePath string, linesToSkip int, hasHeader bool) error

Open initializes access to a delimited file (e.g. CSV file). This includes stats (file size, number of rows), optionally skipping irrelevant lines at the top of the file, parsing the column headers, and creating an index of row positions to enable O(1) access to any part of the file.

Example:

var csvFile seesv.DelimitedFile
err := csvFile.Open("/path/to/file.csv", 1, true)

This opens the specified CSV file and specified that the 1st line of the file should be ignored (skipped) and that the file contains a column header line.

for row := csvFile.Rows(csvFile.RowCount - 10) {
        ...
}

This returns the last 10 rows of the file.

func (DelimitedFile) Row

func (df DelimitedFile) Row(rowNumber int64) []string

Row returns the row specified by rowNumber. Row numbers are zero-based, so .Row(0) returns the first data row of the file.

Row data is returned as an array of strings.

Nil is returned if the specified row number is greater than the number of rows in the file or there is an error parsing the data.

func (DelimitedFile) Rows

func (df DelimitedFile) Rows(rowNumber int64, rowCount int64) <-chan []string

Rows returns a stream of data rows starting from line `rowNumber` in the file.

If `rowCount` is specified, that number of rows will be returned, unless the end of the file is reached.

If -1 is specified for `rowCount` then the stream will start at `rowNumber` and continue until the end of the file.

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL