utils

package
v8.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 14, 2022 License: Apache-2.0, BSD-2-Clause, BSD-3-Clause, + 8 more Imports: 18 Imported by: 0

Documentation

Overview

Package utils contains various internal utilities for the parquet library that aren't intended to be exposed to external consumers such as interfaces and bitmap readers/writers including the RLE encoder/decoder and so on.

Index

Constants

View Source
const (
	MaxIndexType = math.MaxInt32
	MinIndexType = math.MinInt32
)

Max and Min constants for the IndexType

View Source
const (
	MaxValuesPerLiteralRun = (1 << 6) * 8
)

Variables

This section is empty.

Functions

func BytesToBools

func BytesToBools(in []byte, out []bool)

BytesToBools efficiently populates a slice of booleans from an input bitmap

func MaxBufferSize

func MaxBufferSize(width, numValues int) int

func MinBufferSize

func MinBufferSize(bitWidth int) int

func VisitBitBlocks

func VisitBitBlocks(bitmap []byte, offset, length int64, visitValid func(pos int64), visitInvalid func())

VisitBitBlocks is a utility for easily iterating through the blocks of bits in a bitmap, calling the appropriate visitValid/visitInvalid function as we iterate through the bits. visitValid is called with the bitoffset of the valid bit. Don't use this inside a tight loop when performance is needed and instead prefer manually constructing these loops in that scenario.

Types

type BitBlockCount

type BitBlockCount struct {
	Len    int16
	Popcnt int16
}

BitBlockCount is returned by the various bit block counter utilities in order to return a length of bits and the population count of that slice of bits.

func (BitBlockCount) AllSet

func (b BitBlockCount) AllSet() bool

AllSet returns true if ALL the bits were 1 in this set, ie: Popcnt == Len

func (BitBlockCount) NoneSet

func (b BitBlockCount) NoneSet() bool

NoneSet returns true if ALL the bits were 0 in this set, ie: Popcnt == 0

type BitBlockCounter

type BitBlockCounter struct {
	// contains filtered or unexported fields
}

BitBlockCounter is a utility for grabbing chunks of a bitmap at a time and efficiently counting the number of bits which are 1.

func NewBitBlockCounter

func NewBitBlockCounter(bitmap []byte, startOffset, nbits int64) *BitBlockCounter

NewBitBlockCounter returns a BitBlockCounter for the passed bitmap starting at startOffset of length nbits.

func (*BitBlockCounter) NextFourWords

func (b *BitBlockCounter) NextFourWords() BitBlockCount

NextFourWords returns the next run of available bits, usually 256. The returned pair contains the size of run and the number of true values. The last block will have a length less than 256 if the bitmap length is not a multiple of 256, and will return 0-length blocks in subsequent invocations.

func (*BitBlockCounter) NextWord

func (b *BitBlockCounter) NextWord() BitBlockCount

NextWord returns the next run of available bits, usually 64. The returned pair contains the size of run and the number of true values. The last block will have a length less than 64 if the bitmap length is not a multiple of 64, and will return 0-length blocks in subsequent invocations.

type BitReader

type BitReader struct {
	// contains filtered or unexported fields
}

BitReader implements functionality for reading bits or bytes buffering up to a uint64 at a time from the reader in order to improve efficiency. It also provides methods to read multiple bytes in one read such as encoded ints/values.

This BitReader is the basis for the other utility classes like RLE decoding and such, providing the necessary functions for interpreting the values.

func NewBitReader

func NewBitReader(r reader) *BitReader

NewBitReader takes in a reader that implements io.Reader, io.ReaderAt and io.Seeker interfaces and returns a BitReader for use with various bit level manipulations.

func (*BitReader) CurOffset

func (b *BitReader) CurOffset() int64

CurOffset returns the current Byte offset into the data that the reader is at.

func (*BitReader) GetAligned

func (b *BitReader) GetAligned(nbytes int, v interface{}) bool

GetAligned reads nbytes from the underlying stream into the passed interface value. Returning false if there aren't enough bytes remaining in the stream or if an invalid type is passed. The bytes are read aligned to byte boundaries.

v must be a pointer to a byte or sized uint type (*byte, *uint16, *uint32, *uint64). encoded values are assumed to be little endian.

func (*BitReader) GetBatch

func (b *BitReader) GetBatch(bits uint, out []uint64) (int, error)

GetBatch fills out by decoding values repeated from the stream that are encoded using bits as the number of bits per value. The values are expected to be bit packed so we will unpack the values to populate.

func (*BitReader) GetBatchBools

func (b *BitReader) GetBatchBools(out []bool) (int, error)

GetBatchBools is like GetBatch but optimized for reading bits as boolean values

func (*BitReader) GetBatchIndex

func (b *BitReader) GetBatchIndex(bits uint, out []IndexType) (i int, err error)

GetBatchIndex is like GetBatch but for IndexType (used for dictionary decoding)

func (*BitReader) GetValue

func (b *BitReader) GetValue(width int) (uint64, bool)

GetValue returns a single value that is bit packed using width as the number of bits and returns false if there weren't enough bits remaining.

func (*BitReader) GetVlqInt

func (b *BitReader) GetVlqInt() (uint64, bool)

GetVlqInt reads a Vlq encoded int from the stream. The encoded value must start at the beginning of a byte and this returns false if there weren't enough bytes in the buffer or reader. This will call `ReadByte` which in turn retrieves byte aligned values from the reader

func (*BitReader) GetZigZagVlqInt

func (b *BitReader) GetZigZagVlqInt() (int64, bool)

GetZigZagVlqInt reads a zigzag encoded integer, returning false if there weren't enough bytes remaining.

func (*BitReader) ReadByte

func (b *BitReader) ReadByte() (byte, error)

ReadByte reads a single aligned byte from the underlying stream, or populating error if there aren't enough bytes left.

func (*BitReader) Reset

func (b *BitReader) Reset(r reader)

Reset allows reusing a BitReader by setting a new reader and resetting the internal state back to zeros.

type BitWriter

type BitWriter struct {
	// contains filtered or unexported fields
}

BitWriter is a utility for writing values of specific bit widths to a stream using a uint64 as a buffer to build up between flushing for efficiency.

func NewBitWriter

func NewBitWriter(w io.WriterAt) *BitWriter

NewBitWriter initializes a new bit writer to write to the passed in interface using WriteAt to write the appropriate offsets and values.

func (*BitWriter) Clear

func (b *BitWriter) Clear()

Clear resets the writer so that subsequent writes will start from offset 0, allowing reuse of the underlying buffer and writer.

func (*BitWriter) Flush

func (b *BitWriter) Flush(align bool)

Flush will flush any buffered data to the underlying writer, pass true if the next write should be byte-aligned after this flush.

func (*BitWriter) ReserveBytes

func (b *BitWriter) ReserveBytes(nbytes int) int

ReserveBytes reserves the next aligned nbytes, skipping them and returning the offset to use with WriteAt to write to those reserved bytes. Used for RLE encoding to fill in the indicators after encoding.

func (*BitWriter) WriteAligned

func (b *BitWriter) WriteAligned(val uint64, nbytes int) bool

WriteAligned writes the value val as a little endian value in exactly nbytes byte-aligned to the underlying writer, flushing via Flush(true) before writing nbytes without buffering.

func (*BitWriter) WriteAt

func (b *BitWriter) WriteAt(val []byte, off int64) (int, error)

WriteAt fulfills the io.WriterAt interface to write len(p) bytes from p to the underlying byte slice starting at offset off. It returns the number of bytes written from p (0 <= n <= len(p)) and any error encountered. This allows writing full bytes directly to the underlying writer.

func (*BitWriter) WriteValue

func (b *BitWriter) WriteValue(v uint64, nbits uint) error

WriteValue writes the value v using nbits to pack it, returning false if it fails for some reason.

func (*BitWriter) WriteVlqInt

func (b *BitWriter) WriteVlqInt(v uint64) bool

WriteVlqInt writes v as a vlq encoded integer byte-aligned to the underlying writer without buffering.

func (*BitWriter) WriteZigZagVlqInt

func (b *BitWriter) WriteZigZagVlqInt(v int64) bool

WriteZigZagVlqInt writes a zigzag encoded integer byte-aligned to the underlying writer without buffering.

func (*BitWriter) Written

func (b *BitWriter) Written() int

Written returns the number of bytes that have been written to the BitWriter, not how many bytes have been flushed. Use Flush to ensure that all data is flushed to the underlying writer.

type BitmapWriter

type BitmapWriter interface {
	// Set sets the current bit that will be written
	Set()
	// Clear clears the current bit that will be written
	Clear()
	// Next advances to the next bit for the writer
	Next()
	// Finish flushes the current byte out to the bitmap slice
	Finish()
	// AppendWord takes nbits from word which should be an LSB bitmap and appends them to the bitmap.
	AppendWord(word uint64, nbits int64)
	// AppendBools appends the bit representation of the bools slice, returning the number
	// of bools that were able to fit in the remaining length of the bitmapwriter.
	AppendBools(in []bool) int
	// Pos is the current position that will be written next
	Pos() int
	// Reset allows reusing the bitmapwriter by resetting Pos to start with length as
	// the number of bits that the writer can write.
	Reset(start, length int)
}

BitmapWriter is an interface for bitmap writers so that we can use multiple implementations or swap if necessary.

func NewBitmapWriter

func NewBitmapWriter(bitmap []byte, start, length int) BitmapWriter

func NewFirstTimeBitmapWriter

func NewFirstTimeBitmapWriter(buf []byte, start, length int64) BitmapWriter

NewFirstTimeBitmapWriter creates a bitmap writer that might clobber any bit values following the bits written to the bitmap, as such it is faster than the bitmapwriter that is created with NewBitmapWriter

type DictionaryConverter

type DictionaryConverter interface {
	// Copy takes an interface{} which must be a slice of the appropriate type, and will be populated
	// by the dictionary values at the indexes from the IndexType slice
	Copy(interface{}, []IndexType) error
	// Fill fills interface{} which must be a slice of the appropriate type, with the value
	// specified by the dictionary index passed in.
	Fill(interface{}, IndexType) error
	// FillZero fills interface{}, which must be a slice of the appropriate type, with the zero value
	// for the given type.
	FillZero(interface{})
	// IsValid validates that all of the indexes passed in are valid indexes for the dictionary
	IsValid(...IndexType) bool
}

DictionaryConverter is an interface used for dealing with RLE decoding and encoding when working with dictionaries to get values from indexes.

type IndexType

type IndexType = int32

IndexType is the type we're going to use for Dictionary indexes, currently an alias to int32

type OptionalBitBlockCounter

type OptionalBitBlockCounter struct {
	// contains filtered or unexported fields
}

OptionalBitBlockCounter is a useful counter to iterate through a possibly non-existent validity bitmap to allow us to write one code path for both the with-nulls and no-nulls cases without giving up a lot of performance.

func NewOptionalBitBlockCounter

func NewOptionalBitBlockCounter(bitmap []byte, offset, length int64) *OptionalBitBlockCounter

NewOptionalBitBlockCounter constructs and returns a new bit block counter that can properly handle the case when a bitmap is null, if it is guaranteed that the the bitmap is not nil, then prefer NewBitBlockCounter here.

func (*OptionalBitBlockCounter) NextBlock

func (obc *OptionalBitBlockCounter) NextBlock() BitBlockCount

NextBlock returns block count for next word when the bitmap is available otherwise return a block with length up to INT16_MAX when there is no validity bitmap (so all the referenced values are not null).

func (*OptionalBitBlockCounter) NextWord

func (obc *OptionalBitBlockCounter) NextWord() BitBlockCount

NextWord is like NextBlock, but returns a word-sized block even when there is no validity bitmap

type RleDecoder

type RleDecoder struct {
	// contains filtered or unexported fields
}

func NewRleDecoder

func NewRleDecoder(data *bytes.Reader, width int) *RleDecoder

func (*RleDecoder) GetBatch

func (r *RleDecoder) GetBatch(values []uint64) int

func (*RleDecoder) GetBatchSpaced

func (r *RleDecoder) GetBatchSpaced(vals []uint64, nullcount int, validBits []byte, validBitsOffset int64) (int, error)

func (*RleDecoder) GetBatchWithDict

func (r *RleDecoder) GetBatchWithDict(dc DictionaryConverter, vals interface{}) (int, error)

func (*RleDecoder) GetBatchWithDictByteArray

func (r *RleDecoder) GetBatchWithDictByteArray(dc DictionaryConverter, vals []parquet.ByteArray) (int, error)

func (*RleDecoder) GetBatchWithDictFixedLenByteArray

func (r *RleDecoder) GetBatchWithDictFixedLenByteArray(dc DictionaryConverter, vals []parquet.FixedLenByteArray) (int, error)

func (*RleDecoder) GetBatchWithDictFloat32

func (r *RleDecoder) GetBatchWithDictFloat32(dc DictionaryConverter, vals []float32) (int, error)

func (*RleDecoder) GetBatchWithDictFloat64

func (r *RleDecoder) GetBatchWithDictFloat64(dc DictionaryConverter, vals []float64) (int, error)

func (*RleDecoder) GetBatchWithDictInt32

func (r *RleDecoder) GetBatchWithDictInt32(dc DictionaryConverter, vals []int32) (int, error)

func (*RleDecoder) GetBatchWithDictInt64

func (r *RleDecoder) GetBatchWithDictInt64(dc DictionaryConverter, vals []int64) (int, error)

func (*RleDecoder) GetBatchWithDictInt96

func (r *RleDecoder) GetBatchWithDictInt96(dc DictionaryConverter, vals []parquet.Int96) (int, error)

func (*RleDecoder) GetBatchWithDictSpaced

func (r *RleDecoder) GetBatchWithDictSpaced(dc DictionaryConverter, vals interface{}, nullCount int, validBits []byte, validBitsOffset int64) (int, error)

func (*RleDecoder) GetBatchWithDictSpacedByteArray

func (r *RleDecoder) GetBatchWithDictSpacedByteArray(dc DictionaryConverter, vals []parquet.ByteArray, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedFixedLenByteArray

func (r *RleDecoder) GetBatchWithDictSpacedFixedLenByteArray(dc DictionaryConverter, vals []parquet.FixedLenByteArray, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedFloat32

func (r *RleDecoder) GetBatchWithDictSpacedFloat32(dc DictionaryConverter, vals []float32, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedFloat64

func (r *RleDecoder) GetBatchWithDictSpacedFloat64(dc DictionaryConverter, vals []float64, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedInt32

func (r *RleDecoder) GetBatchWithDictSpacedInt32(dc DictionaryConverter, vals []int32, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedInt64

func (r *RleDecoder) GetBatchWithDictSpacedInt64(dc DictionaryConverter, vals []int64, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetBatchWithDictSpacedInt96

func (r *RleDecoder) GetBatchWithDictSpacedInt96(dc DictionaryConverter, vals []parquet.Int96, nullCount int, validBits []byte, validBitsOffset int64) (totalProcessed int, err error)

func (*RleDecoder) GetValue

func (r *RleDecoder) GetValue() (uint64, bool)

func (*RleDecoder) Next

func (r *RleDecoder) Next() bool

func (*RleDecoder) Reset

func (r *RleDecoder) Reset(data *bytes.Reader, width int)

type RleEncoder

type RleEncoder struct {
	BitWidth int
	// contains filtered or unexported fields
}

func NewRleEncoder

func NewRleEncoder(w io.WriterAt, width int) *RleEncoder

func (*RleEncoder) Clear

func (r *RleEncoder) Clear()

func (*RleEncoder) Flush

func (r *RleEncoder) Flush() int

func (*RleEncoder) Put

func (r *RleEncoder) Put(value uint64) error

Put buffers input values 8 at a time. after seeing all 8 values, it decides whether they should be encoded as a literal or repeated run.

type TellWrapper

type TellWrapper struct {
	io.Writer
	// contains filtered or unexported fields
}

TellWrapper wraps any io.Writer to add a Tell function that tracks the position based on calls to Write. It does not take into account any calls to Seek or any Writes that don't go through the TellWrapper

func (*TellWrapper) Close

func (w *TellWrapper) Close() error

Close makes TellWrapper an io.Closer so that calling Close will also call Close on the wrapped writer if it has a Close function.

func (*TellWrapper) Tell

func (w *TellWrapper) Tell() int64

func (*TellWrapper) Write

func (w *TellWrapper) Write(p []byte) (n int, err error)

type WriteCloserTell

type WriteCloserTell interface {
	io.WriteCloser
	Tell() int64
}

WriteCloserTell is an interface adding a Tell function to a WriteCloser so if the underlying writer has a Close function, it is exposed and not hidden.

type WriterAtBuffer

type WriterAtBuffer struct {
	// contains filtered or unexported fields
}

WriterAtBuffer is a convenience struct for providing a WriteAt function to a byte slice for use with things that want an io.WriterAt

func (*WriterAtBuffer) Len

func (w *WriterAtBuffer) Len() int

Len returns the length of the underlying byte slice.

func (*WriterAtBuffer) WriteAt

func (w *WriterAtBuffer) WriteAt(p []byte, off int64) (n int, err error)

WriteAt fulfills the io.WriterAt interface to write len(p) bytes from p to the underlying byte slice starting at offset off. It returns the number of bytes written from p (0 <= n <= len(p)) and any error encountered.

type WriterAtWithLen

type WriterAtWithLen interface {
	io.WriterAt
	Len() int
}

WriterAtWithLen is an interface for an io.WriterAt with a Len function

func NewWriterAtBuffer

func NewWriterAtBuffer(buf []byte) WriterAtWithLen

NewWriterAtBuffer returns an object which fulfills the io.WriterAt interface by taking ownership of the passed in slice.

type WriterTell

type WriterTell interface {
	io.Writer
	Tell() int64
}

WriterTell is an interface that adds a Tell function to an io.Writer

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL