parquetschema

package
v0.12.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 18, 2022 License: Apache-2.0 Imports: 11 Imported by: 17

Documentation

Overview

Package parquetschema contains functions and data types to manage schema definitions for the parquet-go package. Most importantly, provides a schema definition parser to turn a textual representation of a parquet schema into a SchemaDefinition object.

For the purpose of giving users the ability to define parquet schemas in other ways, this package also exposes the data types necessary for it. Users have the possibility to manually assemble their own SchemaDefinition object manually and programmatically.

To construct a schema definition, start with a SchemaDefinition object and set its RootDocument field to a ColumnDefinition. This "root column" describes the whole message. The root column doesn't have a type on its own, so the SchemaElement can be left unset. Inside the root column definition, you then need to populate children. For each of the children, you need to set the SchemaElement, and either SchemaElement.Type or the children. This is for the following reason: if no type is set, it indicates that this column is a group, consisting of its children. A group without children is nonsensical. If a type is set, it indicates that the field is of a particular type, and therefore can't have any children.

For the purpose of ensuring that schema definitions that were constructed not by the schema parser are sound and don't miss any information, you can use the Validate() function on the SchemaDefinition. It validates the schema definition for general soundness of the set data types, the overall structure (types vs groups), as well as whether logical types or converted types were used and whether the elements using these logical or converted types adhere to the conventions as laid out by the parquet documentation. You can find this documentation here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ColumnDefinition

type ColumnDefinition struct {
	Children      []*ColumnDefinition
	SchemaElement *parquet.SchemaElement
}

ColumnDefinition represents the schema definition of a column and optionally its children.

type SchemaDefinition

type SchemaDefinition struct {
	RootColumn *ColumnDefinition
}

SchemaDefinition represents a valid textual schema definition.

func ParseSchemaDefinition

func ParseSchemaDefinition(schemaText string) (*SchemaDefinition, error)

ParseSchemaDefinition parses a textual schema definition and returns a SchemaDefinition object, or an error if parsing has failed. The textual schema definition needs to adhere to the following grammar:

message ::= 'message' <identifier> '{' <message-body> '}'
message-body ::= <column-definition>*
column-definition ::= <repetition-type> <column-type-definition>
repetition-type ::= 'required' | 'repeated' | 'optional'
column-type-definition ::= <group-definition> | <field-definition>
group-definition ::= 'group' <identifier> <converted-type-annotation>? '{' <message-body> '}'
field-definition ::= <type> <identifier> <logical-type-annotation>? <field-id-definition>? ';'
type ::= 'binary'
	| 'float'
	| 'double'
	| 'boolean'
	| 'int32'
	| 'int64'
	| 'int96'
	| 'fixed_len_byte_array' '(' <number> ')'
converted-type-annotation ::= '(' <converted-type> ')'
converted-type ::= 'UTF8'
	| 'MAP'
	| 'MAP_KEY_VALUE'
	| 'LIST'
	| 'ENUM'
	| 'DECIMAL'
	| 'DATE'
	| 'TIME_MILLIS'
	| 'TIME_MICROS'
	| 'TIMESTAMP_MILLIS'
	| 'TIMESTAMP_MICROS'
	| 'UINT_8'
	| 'UINT_16'
	| 'UINT_32'
	| 'UINT_64'
	| 'INT_8'
	| 'INT_16'
	| 'INT_32'
	| 'INT_64'
	| 'JSON'
	| 'BSON'
	| 'INTERVAL'
logical-type-annotation ::= '(' <logical-type> ')'
logical-type ::= 'STRING'
	| 'DATE'
	| 'TIMESTAMP' '(' <time-unit> ',' <boolean> ')'
	| 'UUID'
	| 'ENUM'
	| 'JSON'
	| 'BSON'
	| 'INT' '(' <bit-width> ',' <boolean> ')'
	| 'DECIMAL' '(' <precision> ',' <scale> ')'
field-id-definition ::= '=' <number>
number ::= <digit>+
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
time-unit ::= 'MILLIS' | 'MICROS' | 'NANOS'
boolean ::= 'false' | 'true'
identifier ::= <all-characters> - ' ' - ';' - '{' - '}' - '(' - ')' - '=' - ','
bit-width ::= '8' | '16' | '32' | '64'
precision ::= <number>
scale ::= <number>
all-characters ::= ? all visible characters ?

For examples of textual schema definitions, please take a look at schema-files/*.schema.

func SchemaDefinitionFromColumnDefinition

func SchemaDefinitionFromColumnDefinition(c *ColumnDefinition) *SchemaDefinition

SchemaDefinitionFromColumnDefinition creates a new schema definition from the provided root column definition.

func (*SchemaDefinition) Clone added in v0.7.0

func (sd *SchemaDefinition) Clone() *SchemaDefinition

Clone returns a deep copy of the schema definition.

func (*SchemaDefinition) SchemaElement

func (sd *SchemaDefinition) SchemaElement() *parquet.SchemaElement

SchemaElement returns the schema element associated with the current schema definition. If no schema element is present, then nil is returned.

func (*SchemaDefinition) String

func (sd *SchemaDefinition) String() string

String returns a textual representation of the schema definition. This textual representation adheres to the format accepted by the ParseSchemaDefinition function. A textual schema definition parsed by ParseSchemaDefinition and turned back into a string by this method repeatedly will always remain the same, save for differences in the emitted whitespaces.

func (*SchemaDefinition) SubSchema

func (sd *SchemaDefinition) SubSchema(name string) *SchemaDefinition

SubSchema returns the direct child of the current schema definition that matches the provided name. If no such child exists, nil is returned.

func (*SchemaDefinition) Validate

func (sd *SchemaDefinition) Validate() error

Validate conducts a validation of the schema definition. This is useful when the schema definition has been constructed programmatically by other means than the schema parser to ensure that it is still valid.

func (*SchemaDefinition) ValidateStrict added in v0.2.0

func (sd *SchemaDefinition) ValidateStrict() error

ValidateStrict conducts a stricter validation of the schema definition. This includes the validation as done by Validate, but prohibits backwards- compatible definitions of LIST and MAP.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL