multicorecsv

package module
v0.0.0-...-303a6d0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 7, 2023 License: MIT Imports: 6 Imported by: 1

README

Build Status Coverage GoReport GoDoc License

multicorecsv

A multicore csv library in Go which is ~3x faster than plain encoding/csv

No newline support on multicorecsv.Reader!

  • muticorecsv does not support reading CSV files with properly quoted/escaped newlines! If you have \n in your source data fields, multicorecsv Read() will not work for you.

API Changes from encoding/csv

  • multicorecsv is an almost drop in replacement for encoding/csv. There's only one new requirement, you must use the Close() method. Best practice is a defer (reader/writer).Close()
func main() {
	in := `first_name,last_name,username
"Rob","Pike",rob
Ken,Thompson,ken
"Robert","Griesemer","gri"
`
	r := multicorecsv.NewReader(strings.NewReader(in))
	defer r.Close() // the underlying strings.Reader cannot be closed,
					// but that doesn't matter, multicorecsv needs to clean up
	for {
		record, err := r.Read()
		if err == io.EOF {
			break
		}
		if err != nil {
			log.Fatal(err)
		}
		fmt.Println(record)
	}
}

Performance

  • With Reader, multicorecsv splits up the data by line, then gives out lines for different cores to parse before putting it back in proper line order for the reader
  • With Writer, multicorecsv sends batches of lines off to be encoded, then writes out the results in order
Performance Tweaks
  • Prior to calling Read or (anytime with Write), you can set the ChunkSize (how many lines are sent to each goroutine at a time)
  • ChunkSize defaults at 50 - for shorter lines of data, give it a higher value, for larger lines, give it less
  • 50 is a general sweet spot for the data generated in the benchmarks

Metrics (finally!)

  • tests run on Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
  • multicorecsv Read() beats encoding/csv by ~3x with 8 CPUs and is about equal on single core tasks
  • multicorecsv Write() beats encoding/csv by ~3x with 8 CPUs and is about equal on single core tasks
  • Benchmarks on other hardware is appreciated!
BenchmarkRead1                     50000             32796 ns/op
BenchmarkRead1-2                  100000             22398 ns/op
BenchmarkRead1-4                  100000             17680 ns/op
BenchmarkRead1-8                  100000             16575 ns/op
BenchmarkRead1-16                 100000             16022 ns/op
BenchmarkRead10                    50000             32064 ns/op
BenchmarkRead10-2                 100000             19812 ns/op
BenchmarkRead10-4                 100000             14199 ns/op
BenchmarkRead10-8                 100000             10931 ns/op
BenchmarkRead10-16                200000             10726 ns/op
BenchmarkRead50                    50000             34506 ns/op
BenchmarkRead50-2                 100000             19202 ns/op
BenchmarkRead50-4                 100000             13262 ns/op
BenchmarkRead50-8                 200000             10555 ns/op
BenchmarkRead50-16                200000             10781 ns/op
BenchmarkRead100                   50000             35907 ns/op
BenchmarkRead100-2                100000             18461 ns/op
BenchmarkRead100-4                100000             13138 ns/op
BenchmarkRead100-8                200000             10364 ns/op
BenchmarkRead100-16               200000             10513 ns/op
BenchmarkRead1000                  50000             34773 ns/op
BenchmarkRead1000-2               100000             18581 ns/op
BenchmarkRead1000-4               100000             11184 ns/op
BenchmarkRead1000-8               200000             11484 ns/op
BenchmarkRead1000-16              100000             10061 ns/op
BenchmarkEncodingCSVRead               50000             27706 ns/op
BenchmarkEncodingCSVRead-2             50000             27765 ns/op
BenchmarkEncodingCSVRead-4             50000             28126 ns/op
BenchmarkEncodingCSVRead-8             50000             28090 ns/op
BenchmarkEncodingCSVRead-16            50000             28457 ns/op
BenchmarkWrite1                       50          25826817 ns/op
BenchmarkWrite1-2                    100          19699325 ns/op
BenchmarkWrite1-4                    100          15331869 ns/op
BenchmarkWrite1-8                    100          13768925 ns/op
BenchmarkWrite1-16                   100          13574201 ns/op
BenchmarkWrite10                      50          23285415 ns/op
BenchmarkWrite10-2                   100          12505588 ns/op
BenchmarkWrite10-4                   200           6953782 ns/op
BenchmarkWrite10-8                   200           6870859 ns/op
BenchmarkWrite10-16                  200           7186447 ns/op
BenchmarkWrite50                      50          23483132 ns/op
BenchmarkWrite50-2                   100          12319305 ns/op
BenchmarkWrite50-4                   200           6581955 ns/op
BenchmarkWrite50-8                   200           6711311 ns/op
BenchmarkWrite50-16                  200           6912013 ns/op
BenchmarkWrite100                     50          23413136 ns/op
BenchmarkWrite100-2                  100          12094800 ns/op
BenchmarkWrite100-4                  200           8499881 ns/op
BenchmarkWrite100-8                  200           7291376 ns/op
BenchmarkWrite100-16                 200           7288081 ns/op
BenchmarkWrite1000                    50          23722019 ns/op
BenchmarkWrite1000-2                  50          24078729 ns/op
BenchmarkWrite1000-4                  50          23667001 ns/op
BenchmarkWrite1000-8                  50          23790221 ns/op
BenchmarkWrite1000-16                 50          23639162 ns/op
BenchmarkEncodingCSVWrite             50          26048187 ns/op
BenchmarkEncodingCSVWrite-2           50          22749519 ns/op
BenchmarkEncodingCSVWrite-4           50          23521142 ns/op
BenchmarkEncodingCSVWrite-8           50          23634894 ns/op
BenchmarkEncodingCSVWrite-16          50          23854300 ns/op

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type OldReader

type OldReader struct {

	// the following are from encoding/csv package and are copied into the underlying csv.Reader
	Comma            rune
	Comment          rune
	FieldsPerRecord  int // we can't implement this without more overhead/synchronization
	LazyQuotes       bool
	TrailingComma    bool
	TrimLeadingSpace bool

	ChunkSize int // the # of lines to hand to each goroutine -- default 50
	// contains filtered or unexported fields
}

OldReader contains all the internals required. Use NewReader(io.OldReader).

func OldNewReader

func OldNewReader(r io.Reader) *OldReader

OldNewReader returns a new Reader that reads from r.

func OldNewReaderSized

func OldNewReaderSized(r io.Reader, chunkSize int) *OldReader

NewReader returns a new Reader that reads from r with the chunked size

func (*OldReader) Close

func (mcr *OldReader) Close() error

Close will clean up any goroutines that aren't finished. It will also close the underlying Reader if it implements io.ReadCloser

func (*OldReader) Read

func (mcr *OldReader) Read() ([]string, error)

Read reads one record from r. The record is a slice of strings with each string representing one field. In the background, the internal io.Reader will be read from ahead of the caller utilizing Read() to pull every row

func (*OldReader) ReadAll

func (mcr *OldReader) ReadAll() ([][]string, error)

ReadAll reads all the remaining records from r. Each record is a slice of fields. A successful call returns err == nil, not err == EOF. Because ReadAll is defined to read until EOF, it does not treat end of file as an error to be reported.

func (*OldReader) Stream

func (mcr *OldReader) Stream() (chan []string, chan error)

Stream returns a chan of []string representing a row in the CSV file. Lines are sent on the channel in order they were in the source file. The caller must receive all rows and receive the error from the error chan, otherwise the caller must call Close to clean up any goroutines.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

func NewReader

func NewReader(rdr io.Reader, size int) *Reader

func (*Reader) Close

func (reader *Reader) Close()

Close cleans up the resources created to read the file multicore style

func (*Reader) Read

func (reader *Reader) Read() ([]string, error)

Read returns only valid CSV data as read from the source and removes multilines and stutter if there's an error all subsequent calls to Read will fail with the same error

type Writer

type Writer struct {
	Comma     rune // Field delimiter (set to ',' by NewWriter)
	UseCRLF   bool // True to use \r\n as the line terminator
	ChunkSize int  // the # of lines to hand to each goroutine -- default 50
	// contains filtered or unexported fields
}

A Writer writes records to a CSV encoded file.

As returned by NewWriter, a Writer writes records terminated by a newline and uses ',' as the field delimiter. The exported fields can be changed to customize the details before the first call to Write or WriteAll.

Comma is the field delimiter.

If UseCRLF is true, the Writer ends each record with \r\n instead of \n.

func NewWriter

func NewWriter(iow io.Writer) *Writer

NewWriter returns a new Writer that writes to w. Must call Close when done.

func NewWriterSized

func NewWriterSized(iow io.Writer, chunkSize int) *Writer

NewWriter returns a new Writer that writes to w with a specific chunkSize. Must call Close when done.

func (*Writer) Close

func (mcw *Writer) Close() error

Close closes the underlying io.Writer if it's also an io.Closer as well as cleaning up all goroutines

func (*Writer) Error

func (mcw *Writer) Error() error

Error reports any error that has occurred during a previous Write or Flush.

func (*Writer) Flush

func (mcw *Writer) Flush()

Flush writes any buffered data to the underlying io.Writer. To check if an error occurred during the Flush, call Error.

func (*Writer) Write

func (mcw *Writer) Write(record []string) (err error)

Writer writes a single CSV record to w along with any necessary quoting. A record is a slice of strings with each string being one field.

func (*Writer) WriteAll

func (mcw *Writer) WriteAll(records [][]string) (err error)

WriteAll writes multiple CSV records to w using Write and then calls Flush. Close must still be called after WriteAll to clean up the underlying goroutines

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL