parquet

package
v0.0.0-...-7924348 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 4, 2020 License: MIT Imports: 14 Imported by: 1

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	EndOfChunk = errors.New("EndOfChunk")
)

Functions

func ReadFileMetaData

func ReadFileMetaData(r io.ReadSeeker) (*parquetformat.FileMetaData, error)

ReadFileMetaData reads parquetformat.FileMetaData object from r that provides read interface to data in parquet format.

Parquet format is described here: https://github.com/apache/parquet-format/blob/master/README.md

Types

type Column

type Column struct {
	// contains filtered or unexported fields
}

Column contains information about a single column in a parquet file.

func (Column) Index

func (col Column) Index() int

Index is a 0-based index of col in its schema.

Column chunks in a row group have the same order as columns in the schema.

func (Column) MaxD

func (col Column) MaxD() uint16

MaxD returns the maximum definition level for col.

A read value is not null when its definition level equals to the maximum definition level.

func (Column) MaxR

func (col Column) MaxR() uint16

MaxR returns the maximum repetition level for col.

func (Column) String

func (col Column) String() string

func (Column) Type

func (col Column) Type() parquetformat.Type

Type returns type of col values.

type ColumnChunkReader

type ColumnChunkReader struct {
	// contains filtered or unexported fields
}

ColumnChunkReader allows to read data from a single column chunk of a parquet file.

func (*ColumnChunkReader) DictionaryPageHeader

func (cr *ColumnChunkReader) DictionaryPageHeader() *parquetformat.PageHeader

DictionaryPageHeader returns a DICTIONARY_PAGE page header if the column chunk has one or nil otherwise.

func (*ColumnChunkReader) PageHeader

func (cr *ColumnChunkReader) PageHeader() *parquetformat.PageHeader

PageHeader returns PageHeader of a page that is about to be read or currently being read.

If there was an error reading the last page (including EndOfChunk) PageHeder returns nil.

func (*ColumnChunkReader) Read

func (cr *ColumnChunkReader) Read(values interface{}, dLevels []uint16, rLevels []uint16) (n int, err error)

Read reads up to len(dLevels) values into values and corresponding definition and repetition levels into dLevels and rLevels respectfully. Panics if len(dLevels) != len(rLevels) != len(values). It returns the number of values read (including nulls) and any errors encountered.

Note that after Read values slice contains only non-null values. Number of these values could be less than n.

values must be a slice of interface{} or type that corresponds to the column type (such as []int32 for INT32 column or [][]byte for BYTE_ARRAY column).

When there is not enough values in the current page to fill dLevels Read doesn't advance to the next page and returns the number of values read. If this page was the last page in its column chunk and there is no more data to read it returns EndOfChunk error.

func (*ColumnChunkReader) SkipPage

func (cr *ColumnChunkReader) SkipPage() error

SkipPage positions cr at the beginning of the next page skipping all values in the current page.

Returns EndOfChunk if no more data available

type File

type File struct {
	MetaData *parquetformat.FileMetaData
	Schema   Schema
	// contains filtered or unexported fields
}

func FileFromReader

func FileFromReader(r io.ReadSeeker) (*File, error)

FileFromReader creates parquet.File from io.ReadSeeker.

func OpenFile

func OpenFile(path string) (*File, error)

OpenFile opens a parquet file for reading.

func (*File) Close

func (f *File) Close() error

Close frees up all resources held by f.

func (File) NewReader

func (f File) NewReader(col Column, rg int) (*ColumnChunkReader, error)

NewReader creates a ColumnChunkReader for readng a single column chunk for column col from a row group rg.

type Int96

type Int96 [12]byte

type Schema

type Schema struct {
	// contains filtered or unexported fields
}

Schema describes structure of the data that is stored in a parquet file.

A Schema can be created from a parquetformat.FileMetaData. Information that is stored in RowGroups part of FileMetaData is not needed for the schema creation.

TODO(ksh): provide a way to read FileMetaData without RowGroups.

Usually FileMetaData should be read from the same file as data. When data is split into multiple parquet files metadata can be stored in a separate file. Usually this file is called "_common_metadata".

func MakeSchema

func MakeSchema(meta *parquetformat.FileMetaData) (Schema, error)

MakeSchema creates a Schema from meta.

func (Schema) ColumnByName

func (s Schema) ColumnByName(name string) (col Column, found bool)

ColumnByName returns a Column with the given name (individual elements are separated with ".").

func (Schema) ColumnByPath

func (s Schema) ColumnByPath(path []string) (col Column, found bool)

ColumnByPath returns a Column for the given path.

func (Schema) Columns

func (s Schema) Columns() []Column

Columns returns all columns defined in s.

func (Schema) DisplayString

func (s Schema) DisplayString() string

DisplayString returns a string representation of s using textual format similar to that described in the Dremel paper and used by parquet-mr project.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL