processor-split-table

module

v1.3.2-0...-7f908ad Latest Latest Go to latest Published: Jan 4, 2024 License: MIT

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/keboola/processor-split-table

Links

Open Source Insights

README ¶

Split Table CLI / Processor

Motivation

Keboola components usually generate one uncompressed CSV file.
Database backends support parallel import of multiple CSV slices.
- Importing one large CSV is unnecessarily slow.
Staging storage may not support large files.
- For example, the maximum file size on Google Cloud Storage is 4GB.
The standard gzip tool only works in one thread and is slow.

This utility addresses these issues and provides fast slicing and compression for CSV files.
It can be run as a Keboola component/processor or as a separate CLI binary.

CLI

The slicer CLI slices and optionally compresses the input table.

Input table can be a single or a sliced CSV table.
The input table may or may not be compressed.

Download

You can download the CLI from the GitHub Releases.

Logging

debug and info messages are logged to the stdout.
warn and error messages are logged to the stderr.

Slicing Progress

Slicer logs progress with exponentially increasing intervals.

Use following flags to modify logging intervals:

--log-interval-initial duration
- Initial log interval. (default 10s)
--log-interval-maximum duration
- Maximum log interval. (default 15m0s)
--log-interval-multiplier float
- Log interval multiplier. (default 1.5)

Skipping small input tables

Slicing and compression of small tables may not make any sense.

Job can be slower than if we skipped slicing completely.
Therefore, it is possible to set a size threshold .
The threshold must be exceeded by at least one slice to start slicing.
If the table is not sliced, the size of the entire table must exceed the threshold.
Otherwise, the table is copied to the output without modification.

The threshold can be configured by the following flags:

--input-size-low-exit-code int
- If specified, the skipped tables is not be copied, but the program exits with the exit code.
--input-size-threshold string
- At least one slice must exceed the threshold, otherwise the table is copied without modification. (default "50MB")

Modes

There are the following modes for slicing the input table.
You can configure the mode using the --mode flag.
The default mode is bytes:

`bytes`

New slice is created when the --bytes-per-slice limit is reached.
Bytes size is measured before output compression, if any.

`rows`

New slice is created when the --rows-per-slice limit is reached.

`slices`

The table is split into a fixed --number-of-slices.
Each slice except the last must have at least --min-bytes-per-slice, it takes precedence.

Input and output table

--table-name required
- Table name for logging purposes.
--table-input-path required
- Path to the input table, either a file or a directory with slices.
--table-input-manifest-path
- Path to the manifest of the input table.
- It is used to get delimiter and enclosure fields, if any.
- It can be omitted only if the table does not have a manifest.
--table-output-path required
- Directory where the slices of the output table will be written.
- If it does not exist, it will be created, but the parent directory must exist.
--table-output-manifest-path required
- Path where the output manifest will be written.
- The parent directory must exist.
- The output manifest is a copy of the input manifest.
- The columns field is set from the CSV header, if it is missing.

Environment Variables

Each flag can be specified via an env variable with the SLICER_ prefix.
For example --bytes-per-slice flag can be specified via SLICER_BYTES_PER_SLICE env.

CPU and Memory Usage

CPU usage and speed can be influenced by the --gzip-concurrency flag.
Memory usage can be influenced by following flags:
- --buffer-size
- --gzip-concurrency
- --gzip-block-size
- --memory-limit
Examples:
- Example 1: Unsliced and uncompressed CSV table on the input:
  - Speed: ~ 200MB/s
  - Memory usage: < 100MB
- Example 2: Sliced and compressed CSV table on the input:
  - Speed: ~ 100MB/s
  - Memory usage: < 100MB

All Flags

Expand

--ahead-block-size int
- Or SLICER_AHEAD_BLOCK_SIZE env.
- Size of a one read ahead input block. (default "1MB")
--ahead-blocks int
- Or SLICER_AHEAD_BLOCKS env.
- Number of blocks read ahead from an input slice. (default 16)
- 0 disables read-ahead.
--ahead-slices int
- Or SLICER_AHEAD_SLICES env.
- Number of input slices opened ahead. (default 1)
--buffer-size string
- Or SLICER_BUFFER_SIZE env.
- Output buffer size when gzip compression is disabled. (default "20MB")
--bytes-per-slice string
- Or SLICER_BYTES_PER_SLICE env.
- Maximum size of a slice, for "bytes"" mode. (default "500MB")
--cpuprofile string
- Or SLICER_CPUPROFILE env.
- Write the CPU profile to the specified file.
--dump-config
- Or SLICER_DUMP_CONFIG env.
- Print all parameters to the STDOUT.
--gzip
- Or SLICER_GZIP env.
- Enable gzip compression for slices. (default true)
--gzip-block-size string
- Or SLICER_GZIP_BLOCK_SIZE env.
- Size of the one gzip block; allocated memory = concurrency * block size. (default "2MB")
--gzip-concurrency int
- Or SLICER_GZIP_CONCURRENCY env.
- Number of parallel processed gzip blocks, 0 means the number of CPU threads.
--gzip-level int
- Or SLICER_GZIP_LEVEL env.
- GZIP compression level, range: 1 best speed - 9 best compression. (default 2)
--help
- Or SLICER_HELP env.
- Print help.
--input-size-low-exit-code int
- Or SLICER_INPUT_SIZE_LOW_EXIT_CODE env.
- If specified, the skipped tables is not be copied, but the program exits with the exit code.
--input-size-threshold string
- Or SLICER_INPUT_SIZE_THRESHOLD env.
- Minimum size of an input slice to start slicing, otherwise the table is only copied. (default "50MB")
--memory-limit string
- Or SLICER_MEMORY_LIMIT env.
- Soft memory limit, GOMEMLIMIT. (default "256MB")
--min-bytes-per-slice string
- Or SLICER_MIN_BYTES_PER_SLICE env.
- Minimum size of a slice, for "slices" mode. (default "4MB")
--log-interval-initial duration
- Or SLICER_LOG_INTERVAL_INITIAL.
- Initial log interval. (default 10s)
--log-interval-maximum duration
- Or SLICER_LOG_INTERVAL_MAXIMUM.
- Maximum log interval. (default 15m0s)
--log-interval-multiplier float
- Or SLICER_LOG_INTERVAL_MULTIPLIER.
- Log interval multiplier. (default 1.5)
--mode string
- Or SLICER_MODE env.
- bytes, rows, or slices (default "bytes")
--number-of-slices int
- Or SLICER_NUMBER_OF_SLICES env.
- Number of slices, for "slices" mode. (default 60)
--rows-per-slice int
- Or SLICER_ROWS_PER_SLICE env.
- Maximum number of rows per slice, for "rows" mode. (default 1000000)
--table-input-manifest-path string
- Or SLICER_TABLE_INPUT_MANIFEST_PATH env.
- Path to the manifest describing the input table, if any.
--table-input-path string
- Or SLICER_TABLE_INPUT_PATH env.
- Path to the input table, either a file or a directory with slices.
--table-name string
- Or SLICER_TABLE_NAME env.
- Table name for logging purposes.
--table-output-manifest-path string
- Or SLICER_TABLE_OUTPUT_MANIFEST_PATH env.
- Path where the output manifest will be written.
--table-output-path *string`
- Or SLICER_TABLE_OUTPUT_PATH env.
- Directory where the slices of the output table will be written.

Split Table Processor

Expand

Takes all CSV files in /data/in/tables and converts them to sliced tables.
The default slice size is 500MB, the number of rows per slice can also be configured.
The approximate speed in Keboola Connection is 200 MB / s (if gzip disabled).
Manifest is created if needed. Original manifest's keys are preserved.
Header from CSV table is moved to manifest's columns key if input table is not headless.
CSV delimiter and enclosure are loaded from manifest if set.
Files and already sliced tables are copied without change.

Usage

It supports optional parameters:

mode - enum (bytes, rows, slices), default bytes
bytesPerSlice (string/int) - for mode = bytes, maximum size of the one slice in bytes before compression, default 500MB
rowsPerSlice (int) - for mode = rows, maximum rows in the one slice, default 1 000 000
numberOfSlices (int) - for mode = slices, fixed number of slices, default 60
minBytesPerSlice (string/int) - for mode = slices, minimum size of the one slice in bytes before compression, default 4MB.
gzip (bool) - enable gzip compression, default true
gzipLevel (int) - compression level, min 1 - the best speed), max 9 - the best compression, default 2

Sample configurations

Default parameters (500 MB per slice, gzip enabled):

{
  "definition": {
    "component": "keboola.processor-split-table"
  }
}

Bytes mode:

{
  "definition": {
    "component": "keboola.processor-split-table"
  },
  "parameters": {
    "mode": "bytes",
    "bytesPerSlice": "100MB"
  }
}

Rows mode:

{
  "definition": {
    "component": "keboola.processor-split-table"
  },
  "parameters": {
    "mode": "rows",
    "rowsPerSlice": 5000000
  }
}

Slices mode:

{
  "definition": {
    "component": "keboola.processor-split-table"
  },
  "parameters": {
    "mode": "slices",
    "numberOfSlices": 30,
    "minBytesPerSlice": "10MB"
  }
}

Development

Clone this repository and init the workspace with following command:

git clone https://github.com/keboola/processor-split-table
cd processor-split-table
docker-compose build

Run the test suite and download the dependencies using this command:

docker-compose run --rm -u "$UID:$GID" dev make ci

Run bash in the container:

docker-compose run --rm -u "$UID:$GID" dev bash

Integration

For information about deployment and integration with KBC, please refer to the deployment section of developers documentation

License

MIT licensed, see LICENSE file.

Directories ¶

Path	Synopsis
cmd
cli
processor
internal
pkg/cli
pkg/cli/config
pkg/kbc Package kbc provides components Common Interface implementation https://developers.keboola.com/extend/common-interface/	Package kbc provides components Common Interface implementation https://developers.keboola.com/extend/common-interface/
pkg/log Package log provides logger compatible with components Component Interface.	Package log provides logger compatible with components Component Interface.
pkg/manifest Package manifest provides table manifest reading and writing.	Package manifest provides table manifest reading and writing.
pkg/pool Package pool provides reusing of buffered and GZIP readers and writers to optimize memory usage.	Package pool provides reusing of buffered and GZIP readers and writers to optimize memory usage.
pkg/processor Package processor provides split processor implementation.	Package processor provides split processor implementation.
pkg/processor/config Package config provides processor configuration.	Package config provides processor configuration.
pkg/processor/finder Package finder provides detection of files to process by the processor.	Package finder provides detection of files to process by the processor.
pkg/slicer Package slicer provider slicing of an input table to an output table according to the configuration.	Package slicer provider slicing of an input table to an output table according to the configuration.
pkg/slicer/closer Package closer provides utility to close readers or writers chain in LIFO order.	Package closer provides utility to close readers or writers chain in LIFO order.
pkg/slicer/columnsparser
pkg/slicer/config Package config provides slicing configuration.	Package config provides slicing configuration.
pkg/slicer/rowsreader
pkg/slicer/rowsreader/progress Package progress provides slicing progress logger.	Package progress provides slicing progress logger.
pkg/slicer/slicedwriter
pkg/utils
test

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL