imagine

package
v0.0.0-...-abed277 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 19, 2019 License: BSD-3-Clause Imports: 20 Imported by: 0

README

Imagine that you had a database with...

This tool intends to provide a way to populate a Pilosa database with predictable contents in a reasonably efficient fashion, without needing enormous static data files. Indexes and fields within them can be specified in a TOML file.

Invocation

The imagine utility takes command line options, followed by one or more spec files, which are TOML files containing specs.

What imagine does with the spec files is controlled by the following behavior options:

  • --describe describe the specs
  • --verify string index structure validation: create/error/purge/update/none
  • --generate generate data as specified by workloads
  • --delete delete specified fields

Invoked without behavior options, or with only --describe, imagine will describe the indexes and workloads from its spec files, and terminate. If one or more of verify, generate, or delete is provided, it will do those in order.

The following verification options exist:

  • create: Attempts to create all specified indexes and fields, errors out if any already existed.
  • error: Verify that indexes and fields exist, error out if they don't.
  • purge: Delete all existing indexes and fields, then try to create them. Error out if either part of this fails.
  • update: Try to create any missing indexes or fields. Error out if this fails.
  • none: Do no verification. (Workloads will still check for index/field existence.)

The default for --verify is determined by other parameters; if --delete is present, and --generate is not, the default verification is "none" (there's no point in verifying that things exist right before deleting them), otherwise the default verification is "error".

The following options change how imagine goes about its work:

  • --column-scale int scale number of columns provided by specs
  • --cpu-profile string record CPU profile to file
  • --dry-run dry-run; describe what would be done
  • --hosts string comma separated list of "host:port" pairs of the Pilosa cluster (default "localhost:10101")
  • --mem-profile string record allocation profile to file
  • --prefix string prefix to use on index names
  • --row-scale int scale number of rows provided by specs
  • --thread-count int number of threads to use for import, overrides value in config file (default 1)
  • --time report on time elapsed for operations

Spec files

The following global settings exist for each spec:

  • densityscale: A density scale factor used to determine the precision used for density computations. Density scale should be a power of two. Higher density scales will take longer to compute and process. (The operation is O(log2(N)).)
  • prefix: A preferred prefix to use for index names. If absent, imaginary is used. The --prefix command line option overrides this.
  • version: The string "1.0". The intent is that future versions of the tool will attempt to ensure that a given spec produces identical results. If a later version would change the results from a spec, it should do so only when a different string is specified here. However, this guarantee is not yet in force. The software is still in an immature state, and may change output significantly during development.
  • seed: A default PRNG seed, used for indexes/fields that don't specify their own.

A spec can specify two other kinds of things, indexes and workloads. Indexes describe the data that will go in a Pilosa index, such as the index's name, size (in columns), and number of fields. Workloads describe specific patterns of creating and inserting data in fields.

When multiple specs are provided, they are combined. Indexes and fields are merged; any conflicts between them are an error, and imagine will report such errors and then stop. Workloads are concatenated, with specs processed in command-line order.

Indexes

Indexes are defined in a top-level map, using the index name as the key. Each index would typically be written as [indexes.indexname]. Each index has settings, plus field entries under the index. Fields are a mapping of names to field specifications.

  • name: The index's name. (This will be prefixed later.)
  • description: A longer description of the index's purpose within a set.
  • columns: The number of columns.
  • seed: A default PRNG seed to use for fields that don't specify their own.
Fields

Fields can take several kinds, specified as "type". Defined types:

  • set: The default "set" field type, where rows correspond to specific values.
  • mutex: The "mutex" field type, which is like a set, only it enforces that only one row is set per column.
  • int: The binary-representation field type, usable for range queries.
  • time: The "time" field type, which is a set with additional optional timestamp information.

All fields share some common parameters:

  • zipfV, zipfS: the V/S values used for a Zipf distribution of values.
  • min, max: Minimum and maximum values. For int fields, this is the value range; for set/mutex fields, it's the range of rows that will be potentially generated.
  • sourceIndex: An index to use for values; the value range will be the source index's column range. If source index has 100,000 columns, this is equivalent to "min: 0, max: 99999".
  • density: The field's base density of bits set. For a set, this density applies to each row independently; for a mutex or int field, it determines how many columns should have a value set.
  • valueRule: "linear" or "zipf". Exact interpretation varies by field type, but "linear" indicates that all rows should have the same density of values, while "zipf" indicates that they should follow a Zipf distribution.
Set/Mutex Fields

Set and mutex fields can also configure a cache type:

  • cache: Cache type, one of "lru" or "none".
Set/Time Fields

Set (and time) fields can be generated either in row-major order (generate one row at a time for all columns) or column-major order (generate all rows for each column).

  • dimensionOrder: string, one of "row" or "column". default is "row".
  • quantum: string, one of "Y", "YM", "YMD", or "YMDH". Valid only for time fields. Default is "YMDH".

Zipf parameters: The first row will have bits set based on the base density provided. Following rows will follow the Zipf distribution's probabilities. For instance, with v=2, s=2, the k=0 probability is proportional to (2+0)**(-2) (1/4), and the k=1 probability is proportional to (2+1)**(-2) (1/9). Thus, the probability of a bit being set in the k=1 row is 4/9 the base density.

The final set of bits does not depend on whether values were computed in rowMajor order. (This guarantee is slightly weaker than other guarantees.)

Mutex Fields

Zipf parameters: This just follows the behavior of the Zipf generator in math/rand. A single value is determined for each column, determining which bit is set.

Int Fields

By default, every member of a int field is set to a random value within the range.

Zipf parameters: This follows the behavior of the Zipf generator in math/rand, with an offset of the minimum value. For instance, a field with min/max of 10/20 behaves exactly like a field with a min/max of 0/10, with 10 added to each value.

Workloads

A workload describes a named series of steps, which apply to indexes and fields previously described. Workloads don't have to be in the same spec files as the indexes and fields they refer to. Workloads are defined in a top-level array, usually using [[workloads]] to refer to them. Workloads are sequential. They have the following attributes:

  • name: The name of the workload.
  • description: A description of the workload.
  • threadCount: Number of importer threads to use in imports.
  • batchSize: The default size of import batches (number of records before the client transmits records to the server).

Each workload also has an array of tasks, which are all executed in parallel.

Tasks

Each task outlines a specific set of data to populate in a given field.

  • index, field: the index and field names to identify the field to be populated. The index name should match the name in the spec, not including any prefixes.
  • seed: the random number seed to use when populating this field. Defaults to the seed for the field's parent index.
  • columns: the number of columns to populate. default: populate the entire field, using the index's columns.
  • columnOffset: column to start with. The special value "append" means to create new columns starting immediately after the highest column previously created.
  • columnOrder: "linear", "stride", "zipf", or "permute" (default linear). Indicates order in which to generate column values.
  • stride: The stride to use with a columnOrder of "stride".
  • rowOrder: "linear" or "permute" (default linear). Determines the order in which row values are computed, for set fields, or whether to permute generated values, for mutex or int fields.
  • batchSize: Size of import batches (overrides, but defaults to, batch's batchSize).
  • stamp: Controls timestamp behavior. One of "none", "random", "increasing".
  • stampRange: A duration over which to spread timestamps when generating them.
  • stampStart: A specific time to start timestamps at. Defaults to current time minus stamp range.
  • zipfV, zipfS: V and S values for a zipf distribution of columns.
  • zipfRange: The range to use for the zipf distribution (defaults to columns).

As a special case, when columnOffset is "append" and columnOrder is "zipf", values are randomly generated using a zipf distribution over [0,zipfRange). These values are then subtracted from the next column number -- the lowest column number not currently known to imagine, to produce a range which might indicate updates to existing columns, or might indicate a new column. A value is generated for each column. Note that this will pick values the same way mutex or int fields do, rather than generating all the values for each column, and the same column may be generated more than once. This behavior attempts to simulate likely behavior for event streams.

The "zipf" columnOrder is not supported except with columnOffset of "append", and the Zipf parameters are not defined for any other column order.

Data Generation

Reproducible data generation means being able to generate the same bits every time. To this end, we use a seekable PRNG -- you can specify an offset into its stream and get the same bits every time. See the related package apophenia for details.

Set values are computed using apophenia.Weighted, with seed equal to the row number, and id equal to the column number.

Mutex/Int: Mutex and int fields both generate a single value in their range. Linear values are computed using row 0, iter 0, and are computed as min + U % (max - min). (For a mutex, the minimum value is always 0.) Zipf values are computed using iterated values for row 0 as inputs to another algorithm which treats them as [0,1) range values. If RowOrder is set to permute, the permutation is computed using permutation row 2.

Permuted column values are generated by requesting a permutation generator for row 0 with the given seed. Permuted row values for sets are generated using a permutation generator for row 1.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ReadSpec

func ReadSpec(path string) (*tomlSpec, error)

Types

type Config

type Config struct {
	Hosts    []string `help:"comma separated list of \"host:port\" pairs of the Pilosa cluster"`
	NoImport bool     `help:"do not import the generated bits"`
	PrintOut bool     `help:"print out the generated data in ROW_ID,COLUMN_ID format"`
	Verify   string   `help:"index structure validation: purge/error/update/create"`

	Generate bool `help:"generate data as specified by workloads"`
	Delete   bool `help:"delete specified indexes"`
	Describe bool `help:"describe the data sets and workloads"`

	Prefix      string `help:"prefix to use on index names"`
	CPUProfile  string `help:"record CPU profile to file"`
	MemProfile  string `help:"record allocation profile to file"`
	Time        bool   `help:"report on time elapsed for operations"`
	Status      bool   `help:"show status updates while processing"`
	ColumnScale int64  `help:"scale number of columns provided by specs"`
	RowScale    int64  `help:"scale number of rows provided by specs"`
	LogImports  string `help:"file name to log all imports to (so they can be replayed later)"`
	ThreadCount int    `help:"number of threads to use for each import, overrides value set in config file"`
	// contains filtered or unexported fields
}

Config describes the overall configuration of the tool.

func NewConfig

func NewConfig() *Config

NewConfig initializes a config struct with default/initial values, which can be overridden by command line options.

func (*Config) ApplyNamedWorkload

func (conf *Config) ApplyNamedWorkload(client *pilosa.Client, nwl namedWorkload) (err error)

ApplyNamedWorkload attempts to process each workload in a named workload.

func (*Config) ApplyTasks

func (conf *Config) ApplyTasks(client *pilosa.Client, allTasks []*taskSpec) (err error)

ApplyBatch attempts to process the configured batch.

func (*Config) ApplyWorkload

func (conf *Config) ApplyWorkload(client *pilosa.Client, wl *workloadSpec) (err error)

ApplyWorkload attempts to process a workload.

func (*Config) ApplyWorkloads

func (conf *Config) ApplyWorkloads(client *pilosa.Client) error

ApplyWorkloads attempts to process the configured workloads.

func (*Config) CompareFields

func (conf *Config) CompareFields(client *pilosa.Client, dbIndex *pilosa.Index, spec *indexSpec, mayCreate, mustCreate bool) (changed bool, errs []error)

CompareFields checks the individual fields in an index, following the same logic as CompareIndexes. mayCreate indicates that it's acceptable to create an index if it's missing. mustCreate indicates that it's mandatory to create an index, so it must not have previously existed.

func (*Config) CompareIndexes

func (conf *Config) CompareIndexes(client *pilosa.Client, mayCreate, mustCreate bool) error

CompareIndexes is the general form of comparing client and server index specs. mayCreate indicates that it's acceptable to create an index if it's missing. mustCreate indicates that it's mandatory to create an index, so it must not have previously existed.

func (*Config) CreateIndexes

func (conf *Config) CreateIndexes(client *pilosa.Client) error

CreateIndexes attempts to create indexes, erroring out if any exist.

func (*Config) DeleteIndexes

func (conf *Config) DeleteIndexes(client *pilosa.Client) error

DeleteIndexes attempts to delete all specified indexes.

func (*Config) Execute

func (conf *Config) Execute()

Execute executes the imagine command.

func (*Config) NewSpecsFiles

func (conf *Config) NewSpecsFiles(files []string)

NewSpecsFiles copies files to config.specFiles.

func (*Config) ReadSpecs

func (conf *Config) ReadSpecs() error

ReadSpecs reads the files in conf.specFiles and populates fields.

func (*Config) Run

func (conf *Config) Run() error

Run does validation on the configuration data. Used by commandeer.

func (*Config) UpdateIndexes

func (conf *Config) UpdateIndexes(client *pilosa.Client) error

UpdateIndexes attempts to create indexes or fields, accepting existing things as long as they match.

func (*Config) VerifyIndexes

func (conf *Config) VerifyIndexes(client *pilosa.Client) error

VerifyIndexes verifies that the indexes and fields specified already exist and match.

type CountingIterator

type CountingIterator interface {
	pilosa.RecordIterator
	Values() (int64, int64)
}

CountingIterator represents a pilosa.RecordIterator which additionally reports back how many values it's generated, useful for reporting on what was done and seeing how number of bits (as opposed to number of columns/rows) is affecting performance.

func NewGenerator

func NewGenerator(ts *taskSpec, updateChan chan taskUpdate, updateID string) (CountingIterator, []pilosa.ImportOption, error)

NewGenerator makes a generator which will generate the values for the given task.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL