Package bulkingest defines a workload that is intended to stress some edge cases in our bulk-ingestion infrastructure.
In both IMPORT and indexing, many readers scan though the source data (i.e. CSV files or PK rows, respectively) and produce KVs to be ingested. However a given range of that source data could produce any KVs -- i.e. in some schemas or workloads, the produced KVs could have the same ordering or in some they could be random and uniformly distributed in the keyspace. Additionally, both of the processes often include concurrent producers, each scanning their own input files or ranges of a table, and there the distribution could mean that concurrent producers all produce different keys or all produce similar keys at the same time, etc.
This workload is intended to produce testdata that emphasizes these cases. The multi-column PK is intended to make it easy to independently control the prefix of keys. Adding an index on the same columns with the columns reordered can then control the flow of keys between prefixes, stressing any buffering, sorting or other steps in the middle. This can be particularly interesting when concurrent producers are a factor, as the distribution (or lack there of) of their output prefixes at a given moment can cause hotspots.
The workload's schema is a table with columns a, b, and c plus a padding payload string, with the primary key being (a,b,c).
Creating indexes on the different columns in this schema can then trigger different distributions of produced index KVs -- i.e. an index on (b, c) would see each range of PK data produce tightly grouped output that overlaps with the output of A other ranges of the table.
The workload's main parameters are number of distinct values of a, b and c. Initial data batches each correspond to one a/b pair containing c rows. By default, batches are ordered by a then b (a=1/b=1, a=1/b=2, a=1,b=3, ...) though this can optionally be inverted (a=1/b=1, a=2,b=1, a=3,b=1,...).