cockroach: github.com/cockroachdb/cockroach/pkg/storage Index | Files | Directories

package storage

import "github.com/cockroachdb/cockroach/pkg/storage"

Package storage provides low-level storage. It interacts with storage backends (e.g. LevelDB, RocksDB, etc.) via the Engine interface. At one level higher, MVCC provides multi-version concurrency control capability on top of an Engine instance.

The Engine interface provides an API for key-value stores. InMem implements an in-memory engine using a sorted map. RocksDB implements an engine for data stored to local disk using RocksDB, a variant of LevelDB.

MVCC provides a multi-version concurrency control system on top of an engine. MVCC is the basis for Cockroach's support for distributed transactions. It is intended for direct use from storage.Range objects.

Notes on MVCC architecture

Each MVCC value contains a metadata key/value pair and one or more version key/value pairs. The MVCC metadata key is the actual key for the value, using the util/encoding.EncodeBytes scheme. The MVCC metadata value is of type MVCCMetadata and contains the most recent version timestamp and an optional roachpb.Transaction message. If set, the most recent version of the MVCC value is a transactional "intent". It also contains some information on the size of the most recent version's key and value for efficient stat counter computations. Note that it is not necessary to explicitly store the MVCC metadata as its contents can be reconstructed from the most recent versioned value as long as an intent is not present. The implementation takes advantage of this and deletes the MVCC metadata when possible.

Each MVCC version key/value pair has a key which is also binary-encoded, but is suffixed with a decreasing, big-endian encoding of the timestamp (eight bytes for the nanosecond wall time, followed by four bytes for the logical time except for meta key value pairs, for which the timestamp is implicit). The MVCC version value is a message of type roachpb.Value. A deletion is indicated by an empty value. Note that an empty roachpb.Value will encode to a non-empty byte slice. The decreasing encoding on the timestamp sorts the most recent version directly after the metadata key, which is treated specially by the RocksDB comparator (by making the zero timestamp sort first). This increases the likelihood that an Engine.Get() of the MVCC metadata will get the same block containing the most recent version, even if there are many versions. We rely on getting the MVCC metadata key/value and then using it to directly get the MVCC version using the metadata's most recent version timestamp. This avoids using an expensive merge iterator to scan the most recent version. It also allows us to leverage RocksDB's bloom filters.

The following is an example of the sort order for MVCC key/value pairs:

...
keyA: MVCCMetadata of keyA
keyA_Timestamp_n: value of version_n
keyA_Timestamp_n-1: value of version_n-1
...
keyA_Timestamp_0: value of version_0
keyB: MVCCMetadata of keyB

The binary encoding used on the MVCC keys allows arbitrary keys to be stored in the map (no restrictions on intermediate nil-bytes, for example), while still sorting lexicographically and guaranteeing that all timestamp-suffixed MVCC version keys sort consecutively with the metadata key. We use an escape-based encoding which transforms all nul ("\x00") characters in the key and is terminated with the sequence "\x00\x01", which is guaranteed to not occur elsewhere in the encoded value. See util/encoding/encoding.go for more details.

We considered inlining the most recent MVCC version in the MVCCMetadata. This would reduce the storage overhead of storing the same key twice (which is small due to block compression), and the runtime overhead of two separate DB lookups. On the other hand, all writes that create a new version of an existing key would incur a double write as the previous value is moved out of the MVCCMetadata into its versioned key. Preliminary benchmarks have not shown enough performance improvement to justify this change, although we may revisit this decision if it turns out that multiple versions of the same key are rare in practice.

However, we do allow inlining in order to use the MVCC interface to store non-versioned values. It turns out that not everything which Cockroach needs to store would be efficient or possible using MVCC. Examples include transaction records, abort span entries, stats counters, time series data, and system-local config values. However, supporting a mix of encodings is problematic in terms of resulting complexity. So Cockroach treats an MVCC timestamp of zero to mean an inlined, non-versioned value. These values are replaced if they exist on a Put operation and are cleared from the engine on a delete. Importantly, zero-timestamped MVCC values may be merged, as is necessary for stats counters and time series data.

Index

Package Files

array_64bit.go batch.go disk_map.go doc.go engine.go engine_key.go error.go file_util.go in_mem.go intent_interleaving_iter.go intent_reader_writer.go multi_iterator.go mvcc.go mvcc_incremental_iterator.go mvcc_logical_ops.go pebble.go pebble_batch.go pebble_file_registry.go pebble_iterator.go pebble_merge.go pebble_mvcc_scanner.go row_counter.go slice.go slice_go1.9.go sst_info.go sst_iterator.go sst_writer.go stacks.go temp_dir.go temp_engine.go version.go

Constants

const (
    // MVCCVersionTimestampSize is the size of the timestamp portion of MVCC
    // version keys (used to update stats).
    MVCCVersionTimestampSize int64 = 12
    // RecommendedMaxOpenFiles is the recommended value for RocksDB's
    // max_open_files option.
    RecommendedMaxOpenFiles = 10000
    // MinimumMaxOpenFiles is the minimum value that RocksDB's max_open_files
    // option can be set to. While this should be set as high as possible, the
    // minimum total for a single store node must be under 2048 for Windows
    // compatibility. See:
    // https://wpdev.uservoice.com/forums/266908-command-prompt-console-bash-on-ubuntu-on-windo/suggestions/17310124-add-ability-to-change-max-number-of-open-files-for
    MinimumMaxOpenFiles = 1700
)
const DisallowSeparatedIntents = true

DisallowSeparatedIntents is true when separated intents have never been allowed.

const (
    // MaxArrayLen is a safe maximum length for slices on this architecture.
    MaxArrayLen = 1<<50 - 1
)

Variables

var (
    // MVCCKeyMax is a maximum mvcc-encoded key value which sorts after
    // all other keys.
    MVCCKeyMax = MakeMVCCMetadataKey(roachpb.KeyMax)
    // NilKey is the nil MVCCKey.
    NilKey = MVCCKey{}
)
var DefaultStorageEngine enginepb.EngineType

DefaultStorageEngine represents the default storage engine to use.

var EngineComparer = &pebble.Comparer{
    Compare: EngineKeyCompare,

    AbbreviatedKey: func(k []byte) uint64 {
        key, ok := GetKeyPartFromEngineKey(k)
        if !ok {
            return 0
        }
        return pebble.DefaultComparer.AbbreviatedKey(key)
    },

    FormatKey: func(k []byte) fmt.Formatter {
        decoded, ok := DecodeEngineKey(k)
        if !ok {
            return mvccKeyFormatter{err: errors.Errorf("invalid encoded engine key: %x", k)}
        }
        if decoded.IsMVCCKey() {
            mvccKey, err := decoded.ToMVCCKey()
            if err != nil {
                return mvccKeyFormatter{err: err}
            }
            return mvccKeyFormatter{key: mvccKey}
        }
        return EngineKeyFormatter{key: decoded}
    },

    Separator: func(dst, a, b []byte) []byte {
        aKey, ok := GetKeyPartFromEngineKey(a)
        if !ok {
            return append(dst, a...)
        }
        bKey, ok := GetKeyPartFromEngineKey(b)
        if !ok {
            return append(dst, a...)
        }

        if bytes.Equal(aKey, bKey) {
            return append(dst, a...)
        }
        n := len(dst)

        dst = pebble.DefaultComparer.Separator(dst, aKey, bKey)

        buf := dst[n:]
        if bytes.Equal(aKey, buf) {
            return append(dst[:n], a...)
        }

        return append(dst, 0)
    },

    Successor: func(dst, a []byte) []byte {
        aKey, ok := GetKeyPartFromEngineKey(a)
        if !ok {
            return append(dst, a...)
        }
        n := len(dst)

        dst = pebble.DefaultComparer.Successor(dst, aKey)

        buf := dst[n:]
        if bytes.Equal(aKey, buf) {
            return append(dst[:n], a...)
        }

        return append(dst, 0)
    },

    Split: func(k []byte) int {
        key, ok := GetKeyPartFromEngineKey(k)
        if !ok {
            return len(k)
        }

        return len(key) + 1
    },

    Name: "cockroach_comparator",
}

EngineComparer is a pebble.Comparer object that implements MVCC-specific comparator settings for use with Pebble.

var MVCCMerger = &pebble.Merger{
    Name: "cockroach_merge_operator",
    Merge: func(_, value []byte) (pebble.ValueMerger, error) {
        res := &MVCCValueMerger{}
        err := res.MergeNewer(value)
        if err != nil {
            return nil, err
        }
        return res, nil
    },
}

MVCCMerger is a pebble.Merger object that implements the merge operator used by Cockroach.

var MaxSyncDuration = settings.RegisterDurationSetting(
    "storage.max_sync_duration",
    "maximum duration for disk operations; any operations that take longer"+
        " than this setting trigger a warning log entry or process crash",
    maxSyncDurationDefault,
)

MaxSyncDuration is the threshold above which an observed engine sync duration triggers either a warning or a fatal error.

var MaxSyncDurationFatalOnExceeded = settings.RegisterBoolSetting(
    "storage.max_sync_duration.fatal.enabled",
    "if true, fatal the process when a disk operation exceeds storage.max_sync_duration",
    maxSyncDurationFatalOnExceededDefault,
)

MaxSyncDurationFatalOnExceeded governs whether disk stalls longer than MaxSyncDuration fatal the Cockroach process. Defaults to true.

var NewEncryptedEnvFunc func(fs vfs.FS, fr *PebbleFileRegistry, dbDir string, readOnly bool, optionBytes []byte) (vfs.FS, EncryptionStatsHandler, error)

NewEncryptedEnvFunc creates an encrypted environment and returns the vfs.FS to use for reading and writing data. This should be initialized by calling engineccl.Init() before calling NewPebble(). The optionBytes is a binary serialized baseccl.EncryptionOptions, so that non-CCL code does not depend on CCL code.

var PebbleTablePropertyCollectors = []func() pebble.TablePropertyCollector{
    func() pebble.TablePropertyCollector { return &pebbleTimeBoundPropCollector{} },
    func() pebble.TablePropertyCollector { return &pebbleDeleteRangeCollector{} },
}

PebbleTablePropertyCollectors is the list of Pebble TablePropertyCollectors.

func CleanupTempDirs Uses

func CleanupTempDirs(recordPath string) error

CleanupTempDirs removes all directories listed in the record file specified by recordPath. It should be invoked before creating any new temporary directories to clean up abandoned temporary directories. It should also be invoked when a newly created temporary directory is no longer needed and needs to be removed from the record file.

func ClearRangeWithHeuristic Uses

func ClearRangeWithHeuristic(reader Reader, writer Writer, start, end roachpb.Key) error

ClearRangeWithHeuristic clears the keys from start (inclusive) to end (exclusive). Depending on the number of keys, it will either use ClearRawRange or clear individual keys. It works with EngineKeys, so don't expect it to find and clear separated intents if [start, end) refers to MVCC key space.

func ComputeStatsForRange Uses

func ComputeStatsForRange(
    iter SimpleMVCCIterator,
    start, end roachpb.Key,
    nowNanos int64,
    callbacks ...func(MVCCKey, []byte) error,
) (enginepb.MVCCStats, error)

ComputeStatsForRange scans the underlying engine from start to end keys and computes stats counters based on the values. This method is used after a range is split to recompute stats for each subrange. The start key is always adjusted to avoid counting local keys in the event stats are being recomputed for the first range (i.e. the one with start key == KeyMin). The nowNanos arg specifies the wall time in nanoseconds since the epoch and is used to compute the total age of all intents.

When optional callbacks are specified, they are invoked for each physical key-value pair (i.e. not for implicit meta records), and iteration is aborted on the first error returned from any of them.

Callbacks must copy any data they intend to hold on to.

func CreateTempDir Uses

func CreateTempDir(parentDir, prefix string, stopper *stop.Stopper) (string, error)

CreateTempDir creates a temporary directory with a prefix under the given parentDir and returns the absolute path of the temporary directory. It is advised to invoke CleanupTempDirs before creating new temporary directories in cases where the disk is completely full.

func DefaultPebbleOptions Uses

func DefaultPebbleOptions() *pebble.Options

DefaultPebbleOptions returns the default pebble options.

func EncodeKey Uses

func EncodeKey(key MVCCKey) []byte

EncodeKey encodes an engine.MVCC key into the RocksDB representation.

func EncodeKeyToBuf Uses

func EncodeKeyToBuf(buf []byte, key MVCCKey) []byte

EncodeKeyToBuf encodes an engine.MVCC key into the RocksDB representation.

func EngineKeyCompare Uses

func EngineKeyCompare(a, b []byte) int

EngineKeyCompare compares cockroach keys, including the version (which could be MVCC timestamps).

func GetKeyPartFromEngineKey Uses

func GetKeyPartFromEngineKey(engineKey []byte) (key []byte, ok bool)

GetKeyPartFromEngineKey is a specialization of DecodeEngineKey which avoids constructing a slice for the version part of the key, since the caller does not need it.

func InitPebbleLogger Uses

func InitPebbleLogger(ctx context.Context) *log.SecondaryLogger

InitPebbleLogger initializes the logger to use for Pebble log messages. If not called, WARNING, ERROR, and FATAL logs will be output to the normal CockroachDB log. The caller is responsible for ensuring the Close() method is eventually called on the new logger.

func IsValidSplitKey Uses

func IsValidSplitKey(key roachpb.Key) bool

IsValidSplitKey returns whether the key is a valid split key. Adapter for the method above, for use from other packages.

func MVCCBlindConditionalPut Uses

func MVCCBlindConditionalPut(
    ctx context.Context,
    writer Writer,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    expVal []byte,
    allowIfDoesNotExist CPutMissingBehavior,
    txn *roachpb.Transaction,
) error

MVCCBlindConditionalPut is a fast-path of MVCCConditionalPut. See the MVCCConditionalPut comments for details of the semantics. MVCCBlindConditionalPut skips retrieving the existing metadata for the key requiring the caller to guarantee no versions for the key currently exist.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCBlindInitPut Uses

func MVCCBlindInitPut(
    ctx context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    failOnTombstones bool,
    txn *roachpb.Transaction,
) error

MVCCBlindInitPut is a fast-path of MVCCInitPut. See the MVCCInitPut comments for details of the semantics. MVCCBlindInitPut skips retrieving the existing metadata for the key requiring the caller to guarantee no version for the key currently exist.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp parameter is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCBlindPut Uses

func MVCCBlindPut(
    ctx context.Context,
    writer Writer,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    txn *roachpb.Transaction,
) error

MVCCBlindPut is a fast-path of MVCCPut. See the MVCCPut comments for details of the semantics. MVCCBlindPut skips retrieving the existing metadata for the key requiring the caller to guarantee no versions for the key currently exist in order for stats to be updated properly. If a previous version of the key does exist it is up to the caller to properly account for their existence in updating the stats.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCBlindPutProto Uses

func MVCCBlindPutProto(
    ctx context.Context,
    writer Writer,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    msg protoutil.Message,
    txn *roachpb.Transaction,
) error

MVCCBlindPutProto sets the given key to the protobuf-serialized byte string of msg and the provided timestamp. See MVCCBlindPut for a discussion on this fast-path and when it is appropriate to use.

func MVCCClearTimeRange Uses

func MVCCClearTimeRange(
    _ context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key, endKey roachpb.Key,
    startTime, endTime hlc.Timestamp,
    maxBatchSize int64,
) (*roachpb.Span, error)

MVCCClearTimeRange clears all MVCC versions within the span [key, endKey) which have timestamps in the span (startTime, endTime]. This can have the apparent effect of "reverting" the range to startTime if all of the older revisions of cleared keys are still available (i.e. have not been GC'ed).

Long runs of keys that all qualify for clearing will be cleared via a single clear-range operation. Once maxBatchSize Clear and ClearRange operations are hit during iteration, the next matching key is instead returned in the resumeSpan. It is possible to exceed maxBatchSize by up to the size of the buffer of keys selected for deletion but not yet flushed (as done to detect long runs for cleaning in a single ClearRange).

This function does not handle the stats computations to determine the correct incremental deltas of clearing these keys (and correctly determining if it does or not not change the live and gc keys) so the caller is responsible for recomputing stats over the resulting span if needed.

func MVCCConditionalPut Uses

func MVCCConditionalPut(
    ctx context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    expVal []byte,
    allowIfDoesNotExist CPutMissingBehavior,
    txn *roachpb.Transaction,
) error

MVCCConditionalPut sets the value for a specified key only if the expected value matches. If not, the return a ConditionFailedError containing the actual value. An empty expVal signifies that the key is expected to not exist.

The condition check reads a value from the key using the same operational timestamp as we use to write a value.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

An empty expVal means that the key is expected to not exist. If not empty, expValue needs to correspond to a Value.TagAndDataBytes() - i.e. a key's value without the checksum (as the checksum includes the key too).

func MVCCDelete Uses

func MVCCDelete(
    ctx context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    txn *roachpb.Transaction,
) error

MVCCDelete marks the key deleted so that it will not be returned in future get responses.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCDeleteRange Uses

func MVCCDeleteRange(
    ctx context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key, endKey roachpb.Key,
    max int64,
    timestamp hlc.Timestamp,
    txn *roachpb.Transaction,
    returnKeys bool,
) ([]roachpb.Key, *roachpb.Span, int64, error)

MVCCDeleteRange deletes the range of key/value pairs specified by start and end keys. It returns the range of keys deleted when returnedKeys is set, the next span to resume from, and the number of keys deleted. The returned resume span is nil if max keys aren't processed. The choice max=0 disables the limit.

func MVCCFindSplitKey Uses

func MVCCFindSplitKey(
    _ context.Context, reader Reader, key, endKey roachpb.RKey, targetSize int64,
) (roachpb.Key, error)

MVCCFindSplitKey finds a key from the given span such that the left side of the split is roughly targetSize bytes. The returned key will never be chosen from the key ranges listed in keys.NoSplitSpans.

func MVCCGarbageCollect Uses

func MVCCGarbageCollect(
    ctx context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    keys []roachpb.GCRequest_GCKey,
    timestamp hlc.Timestamp,
) error

MVCCGarbageCollect creates an iterator on the ReadWriter. In parallel it iterates through the keys listed for garbage collection by the keys slice. The iterator is seeked in turn to each listed key, clearing all values with timestamps <= to expiration. The timestamp parameter is used to compute the intent age on GC.

Note that this method will be sorting the keys.

func MVCCGet Uses

func MVCCGet(
    ctx context.Context, reader Reader, key roachpb.Key, timestamp hlc.Timestamp, opts MVCCGetOptions,
) (*roachpb.Value, *roachpb.Intent, error)

MVCCGet returns the most recent value for the specified key whose timestamp is less than or equal to the supplied timestamp. If no such value exists, nil is returned instead.

In tombstones mode, if the most recent value is a deletion tombstone, the result will be a non-nil roachpb.Value whose RawBytes field is nil. Otherwise, a deletion tombstone results in a nil roachpb.Value.

In inconsistent mode, if an intent is encountered, it will be placed in the dedicated return parameter. By contrast, in consistent mode, an intent will generate a WriteIntentError with the intent embedded within, and the intent result parameter will be nil.

Note that transactional gets must be consistent. Put another way, only non-transactional gets may be inconsistent.

If the timestamp is specified as hlc.Timestamp{}, the value is expected to be "inlined". See MVCCPut().

When reading in "fail on more recent" mode, a WriteTooOldError will be returned if the read observes a version with a timestamp above the read timestamp. Similarly, a WriteIntentError will be returned if the read observes another transaction's intent, even if it has a timestamp above the read timestamp.

func MVCCGetAsTxn Uses

func MVCCGetAsTxn(
    ctx context.Context,
    reader Reader,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    txnMeta enginepb.TxnMeta,
) (*roachpb.Value, *roachpb.Intent, error)

MVCCGetAsTxn constructs a temporary transaction from the given transaction metadata and calls MVCCGet as that transaction. This method is required only for reading intents of a transaction when only its metadata is known and should rarely be used.

The read is carried out without the chance of uncertainty restarts.

func MVCCGetProto Uses

func MVCCGetProto(
    ctx context.Context,
    reader Reader,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    msg protoutil.Message,
    opts MVCCGetOptions,
) (bool, error)

MVCCGetProto fetches the value at the specified key and unmarshals it into msg if msg is non-nil. Returns true on success or false if the key was not found.

See the documentation for MVCCGet for the semantics of the MVCCGetOptions.

func MVCCIncrement Uses

func MVCCIncrement(
    ctx context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    txn *roachpb.Transaction,
    inc int64,
) (int64, error)

MVCCIncrement fetches the value for key, and assuming the value is an "integer" type, increments it by inc and stores the new value. The newly incremented value is returned.

An initial value is read from the key using the same operational timestamp as we use to write a value.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCInitPut Uses

func MVCCInitPut(
    ctx context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    failOnTombstones bool,
    txn *roachpb.Transaction,
) error

MVCCInitPut sets the value for a specified key if the key doesn't exist. It returns a ConditionFailedError when the write fails or if the key exists with an existing value that is different from the supplied value. If failOnTombstones is set to true, tombstones count as mismatched values and will cause a ConditionFailedError.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCIterate Uses

func MVCCIterate(
    ctx context.Context,
    reader Reader,
    key, endKey roachpb.Key,
    timestamp hlc.Timestamp,
    opts MVCCScanOptions,
    f func(roachpb.KeyValue) error,
) ([]roachpb.Intent, error)

MVCCIterate iterates over the key range [start,end). At each step of the iteration, f() is invoked with the current key/value pair. If f returns true (done) or an error, the iteration stops and the error is propagated. If the reverse is flag set the iterator will be moved in reverse order. If the scan options specify an inconsistent scan, all "ignored" intents will be returned. In consistent mode, intents are only ever returned as part of a WriteIntentError.

func MVCCMerge Uses

func MVCCMerge(
    _ context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
) error

MVCCMerge implements a merge operation. Merge adds integer values, concatenates undifferentiated byte slice values, and efficiently combines time series observations if the roachpb.Value tag value indicates the value byte slice is of type TIMESERIES.

func MVCCPut Uses

func MVCCPut(
    ctx context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    txn *roachpb.Transaction,
) error

MVCCPut sets the value for a specified key. It will save the value with different versions according to its timestamp and update the key metadata. The timestamp must be passed as a parameter; using the Timestamp field on the value results in an error.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp parameter is confusing and redundant. See the comment on mvccPutInternal for details.

If the timestamp is specified as hlc.Timestamp{}, the value is inlined instead of being written as a timestamp-versioned value. A zero timestamp write to a key precludes a subsequent write using a non-zero timestamp and vice versa. Inlined values require only a single row and never accumulate more than a single value. Successive zero timestamp writes to a key replace the value and deletes clear the value. In addition, zero timestamp values may be merged.

func MVCCPutProto Uses

func MVCCPutProto(
    ctx context.Context,
    rw ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    txn *roachpb.Transaction,
    msg protoutil.Message,
) error

MVCCPutProto sets the given key to the protobuf-serialized byte string of msg and the provided timestamp.

func MVCCResolveWriteIntent Uses

func MVCCResolveWriteIntent(
    ctx context.Context, rw ReadWriter, ms *enginepb.MVCCStats, intent roachpb.LockUpdate,
) (bool, error)

MVCCResolveWriteIntent either commits or aborts (rolls back) an extant write intent for a given txn according to commit parameter. ResolveWriteIntent will skip write intents of other txns. It returns whether or not an intent was found to resolve.

Transaction epochs deserve a bit of explanation. The epoch for a transaction is incremented on transaction retries. A transaction retry is different from an abort. Retries can occur in SSI transactions when the commit timestamp is not equal to the proposed transaction timestamp. On a retry, the epoch is incremented instead of creating an entirely new transaction. This allows the intents that were written on previous runs to serve as locks which prevent concurrent reads from further incrementing the timestamp cache, making further transaction retries less likely.

Because successive retries of a transaction may end up writing to different keys, the epochs serve to classify which intents get committed in the event the transaction succeeds (all those with epoch matching the commit epoch), and which intents get aborted, even if the transaction succeeds.

TODO(tschottdorf): encountered a bug in which a Txn committed with its original timestamp after laying down intents at higher timestamps. Doesn't look like this code here caught that. Shouldn't resolve intents when they're not at the timestamp the Txn mandates them to be.

func MVCCResolveWriteIntentRange Uses

func MVCCResolveWriteIntentRange(
    ctx context.Context, rw ReadWriter, ms *enginepb.MVCCStats, intent roachpb.LockUpdate, max int64,
) (int64, *roachpb.Span, error)

MVCCResolveWriteIntentRange commits or aborts (rolls back) the range of write intents specified by start and end keys for a given txn. ResolveWriteIntentRange will skip write intents of other txns. Returns the number of intents resolved and a resume span if the max keys limit was exceeded.

func MVCCResolveWriteIntentRangeUsingIter Uses

func MVCCResolveWriteIntentRangeUsingIter(
    ctx context.Context,
    rw ReadWriter,
    iterAndBuf IterAndBuf,
    ms *enginepb.MVCCStats,
    intent roachpb.LockUpdate,
    max int64,
) (int64, *roachpb.Span, error)

MVCCResolveWriteIntentRangeUsingIter commits or aborts (rolls back) the range of write intents specified by start and end keys for a given txn. ResolveWriteIntentRange will skip write intents of other txns.

Returns the number of intents resolved and a resume span if the max keys limit was exceeded. A max of zero means unbounded. A max of -1 means resolve nothing and return the entire intent span as the resume span.

func MVCCResolveWriteIntentUsingIter Uses

func MVCCResolveWriteIntentUsingIter(
    ctx context.Context,
    rw ReadWriter,
    iterAndBuf IterAndBuf,
    ms *enginepb.MVCCStats,
    intent roachpb.LockUpdate,
) (bool, error)

MVCCResolveWriteIntentUsingIter is a variant of MVCCResolveWriteIntent that uses iterator and buffer passed as parameters (e.g. when used in a loop).

func MVCCScanDecodeKeyValues Uses

func MVCCScanDecodeKeyValues(repr [][]byte, fn func(key MVCCKey, rawBytes []byte) error) error

MVCCScanDecodeKeyValues decodes all key/value pairs returned in one or more MVCCScan "batches" (this is not the RocksDB batch repr format). The provided function is called for each key/value pair.

func MakeValue Uses

func MakeValue(meta enginepb.MVCCMetadata) roachpb.Value

MakeValue returns the inline value.

func MergeInternalTimeSeriesData Uses

func MergeInternalTimeSeriesData(
    usePartialMerge bool, sources ...roachpb.InternalTimeSeriesData,
) (roachpb.InternalTimeSeriesData, error)

MergeInternalTimeSeriesData exports the engine's MVCC merge logic for InternalTimeSeriesData to higher level packages. This is intended primarily for consumption by high level testing of time series functionality. If usePartialMerge is true, the operands are merged together using a partial merge operation first, and are then merged in to the initial state.

func NewPebbleTempEngine Uses

func NewPebbleTempEngine(
    ctx context.Context, tempStorage base.TempStorageConfig, storeSpec base.StoreSpec,
) (diskmap.Factory, fs.FS, error)

NewPebbleTempEngine creates a new Pebble engine for DistSQL processors to use when the working set is larger than can be stored in memory.

func NewTempEngine Uses

func NewTempEngine(
    ctx context.Context, tempStorage base.TempStorageConfig, storeSpec base.StoreSpec,
) (diskmap.Factory, fs.FS, error)

NewTempEngine creates a new engine for DistSQL processors to use when the working set is larger than can be stored in memory.

func PutProto Uses

func PutProto(
    writer Writer, key roachpb.Key, msg protoutil.Message,
) (keyBytes, valBytes int64, err error)

PutProto sets the given key to the protobuf-serialized byte string of msg. Returns the length in bytes of key and the value.

Deprecated: use MVCCPutProto instead.

func RecordTempDir Uses

func RecordTempDir(recordPath, tempPath string) error

RecordTempDir records tempPath to the record file specified by recordPath to facilitate cleanup of the temporary directory on subsequent startups.

func ResolveEncryptedEnvOptions Uses

func ResolveEncryptedEnvOptions(
    cfg *PebbleConfig,
) (*PebbleFileRegistry, EncryptionStatsHandler, error)

ResolveEncryptedEnvOptions fills in cfg.Opts.FS with an encrypted vfs if this store has encryption-at-rest enabled. Also returns the associated file registry and EncryptionStatsHandler.

func RocksDBBatchCount Uses

func RocksDBBatchCount(repr []byte) (int, error)

RocksDBBatchCount provides an efficient way to get the count of mutations in a RocksDB Batch representation.

func SafeWriteToFile Uses

func SafeWriteToFile(fs vfs.FS, dir string, filename string, b []byte) error

SafeWriteToFile writes the byte slice to the filename, contained in dir, using the given fs. It returns after both the file and the containing directory are synced.

func ThreadStacks Uses

func ThreadStacks() string

ThreadStacks returns the stacks for all threads. The stacks are raw addresses, and do not contain symbols. Use addr2line (or atos on Darwin) to symbolize.

func WriteSyncNoop Uses

func WriteSyncNoop(ctx context.Context, eng Engine) error

WriteSyncNoop carries out a synchronous no-op write to the engine.

type Batch Uses

type Batch interface {
    ReadWriter
    // Commit atomically applies any batched updates to the underlying
    // engine. This is a noop unless the batch was created via NewBatch(). If
    // sync is true, the batch is synchronously committed to disk.
    Commit(sync bool) error
    // Distinct returns a view of the existing batch which only sees writes that
    // were performed before the Distinct batch was created. That is, the
    // returned batch will not read its own writes, but it will read writes to
    // the parent batch performed before the call to Distinct(), except if the
    // parent batch is a WriteOnlyBatch, in which case the Distinct() batch will
    // read from the underlying engine.
    //
    // The returned
    // batch needs to be closed before using the parent batch again. This is used
    // as an optimization to avoid flushing mutations buffered by the batch in
    // situations where we know all of the batched operations are for distinct
    // keys.
    //
    // TODO(tbg): it seems insane that you cannot read from a WriteOnlyBatch but
    // you can read from a Distinct on top of a WriteOnlyBatch but randomly don't
    // see the batch at all. I was personally just bitten by this.
    //
    // TODO(itsbilal): Improve comments around how/why distinct batches are an
    // optimization in the rocksdb write path.
    Distinct() ReadWriter
    // Empty returns whether the batch has been written to or not.
    Empty() bool
    // Len returns the size of the underlying representation of the batch.
    // Because of the batch header, the size of the batch is never 0 and should
    // not be used interchangeably with Empty. The method avoids the memory copy
    // that Repr imposes, but it still may require flushing the batch's mutations.
    Len() int
    // Repr returns the underlying representation of the batch and can be used to
    // reconstitute the batch on a remote node using Writer.ApplyBatchRepr().
    Repr() []byte
}

Batch is the interface for batch specific operations.

type BatchType Uses

type BatchType byte

BatchType represents the type of an entry in an encoded RocksDB batch.

const (
    BatchTypeDeletion BatchType = 0x0
    BatchTypeValue    BatchType = 0x1
    BatchTypeMerge    BatchType = 0x2
    BatchTypeLogData  BatchType = 0x3
    // BatchTypeColumnFamilyDeletion       BatchType = 0x4
    // BatchTypeColumnFamilyValue          BatchType = 0x5
    // BatchTypeColumnFamilyMerge          BatchType = 0x6
    BatchTypeSingleDeletion BatchType = 0x7
    // BatchTypeColumnFamilySingleDeletion BatchType = 0x8
    // BatchTypeBeginPrepareXID            BatchType = 0x9
    // BatchTypeEndPrepareXID              BatchType = 0xA
    // BatchTypeCommitXID                  BatchType = 0xB
    // BatchTypeRollbackXID                BatchType = 0xC
    // BatchTypeNoop                       BatchType = 0xD
    // BatchTypeColumnFamilyRangeDeletion  BatchType = 0xE
    BatchTypeRangeDeletion BatchType = 0xF
)

These constants come from rocksdb/db/dbformat.h.

type CPutMissingBehavior Uses

type CPutMissingBehavior bool

CPutMissingBehavior describes the handling a non-existing expected value.

const (
    // CPutAllowIfMissing is used to indicate a CPut can also succeed when the
    // expected entry does not exist.
    CPutAllowIfMissing CPutMissingBehavior = true
    // CPutFailIfMissing is used to indicate the existing value must match the
    // expected value exactly i.e. if a value is expected, it must exist.
    CPutFailIfMissing CPutMissingBehavior = false
)

type EncryptionRegistries Uses

type EncryptionRegistries struct {
    // FileRegistry is the list of files with encryption status.
    // serialized storage/engine/enginepb/file_registry.proto::FileRegistry
    FileRegistry []byte
    // KeyRegistry is the list of keys, scrubbed of actual key data.
    // serialized ccl/storageccl/engineccl/enginepbccl/key_registry.proto::DataKeysRegistry
    KeyRegistry []byte
}

EncryptionRegistries contains the encryption-related registries: Both are serialized protobufs.

type EncryptionStatsHandler Uses

type EncryptionStatsHandler interface {
    // Returns a serialized enginepbccl.EncryptionStatus.
    GetEncryptionStatus() ([]byte, error)
    // Returns a serialized enginepbccl.DataKeysRegistry, scrubbed of key contents.
    GetDataKeysRegistry() ([]byte, error)
    // Returns the ID of the active data key, or "plain" if none.
    GetActiveDataKeyID() (string, error)
    // Returns the enum value of the encryption type.
    GetActiveStoreKeyType() int32
    // Returns the KeyID embedded in the serialized EncryptionSettings.
    GetKeyIDFromSettings(settings []byte) (string, error)
}

EncryptionStatsHandler provides encryption related stats.

type Engine Uses

type Engine interface {
    ReadWriter
    // Attrs returns the engine/store attributes.
    Attrs() roachpb.Attributes
    // Capacity returns capacity details for the engine's available storage.
    Capacity() (roachpb.StoreCapacity, error)
    // Compact forces compaction over the entire database.
    Compact() error
    // Flush causes the engine to write all in-memory data to disk
    // immediately.
    Flush() error
    // GetCompactionStats returns the internal RocksDB compaction stats. See
    // https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#rocksdb-statistics.
    GetCompactionStats() string
    // GetMetrics retrieves metrics from the engine.
    GetMetrics() (*Metrics, error)
    // GetEncryptionRegistries returns the file and key registries when encryption is enabled
    // on the store.
    GetEncryptionRegistries() (*EncryptionRegistries, error)
    // GetEnvStats retrieves stats about the engine's environment
    // For RocksDB, this includes details of at-rest encryption.
    GetEnvStats() (*EnvStats, error)
    // GetAuxiliaryDir returns a path under which files can be stored
    // persistently, and from which data can be ingested by the engine.
    //
    // Not thread safe.
    GetAuxiliaryDir() string
    // NewBatch returns a new instance of a batched engine which wraps
    // this engine. Batched engines accumulate all mutations and apply
    // them atomically on a call to Commit().
    NewBatch() Batch
    // NewReadOnly returns a new instance of a ReadWriter that wraps this
    // engine. This wrapper panics when unexpected operations (e.g., write
    // operations) are executed on it and caches iterators to avoid the overhead
    // of creating multiple iterators for batched reads.
    //
    // All iterators created from a read-only engine with the same "Prefix"
    // option are guaranteed to provide a consistent snapshot of the underlying
    // engine. For instance, two prefix iterators created from a read-only
    // engine will provide a consistent snapshot. Similarly, two non-prefix
    // iterators created from a read-only engine will provide a consistent
    // snapshot. However, a prefix iterator and a non-prefix iterator created
    // from a read-only engine are not guaranteed to provide a consistent view
    // of the underlying engine.
    //
    // TODO(nvanbenschoten): remove this complexity when we're fully on Pebble
    // and can guarantee that all iterators created from a read-only engine are
    // consistent. To do this, we will want to add an MVCCIterator.Clone method.
    NewReadOnly() ReadWriter
    // NewWriteOnlyBatch returns a new instance of a batched engine which wraps
    // this engine. A write-only batch accumulates all mutations and applies them
    // atomically on a call to Commit(). Read operations return an error.
    //
    // Note that a distinct write-only batch allows reads. Distinct batches are a
    // means of indicating that the user does not need to read its own writes.
    //
    // TODO(peter): This should return a WriteBatch interface, but there are mild
    // complications in both defining that interface and implementing it. In
    // particular, Batch.Close would no longer come from Reader and we'd need to
    // refactor a bunch of code in rocksDBBatch.
    NewWriteOnlyBatch() Batch
    // NewSnapshot returns a new instance of a read-only snapshot
    // engine. Snapshots are instantaneous and, as long as they're
    // released relatively quickly, inexpensive. Snapshots are released
    // by invoking Close(). Note that snapshots must not be used after the
    // original engine has been stopped.
    NewSnapshot() Reader
    // Type returns engine type.
    Type() enginepb.EngineType
    // IngestExternalFiles atomically links a slice of files into the RocksDB
    // log-structured merge-tree.
    IngestExternalFiles(ctx context.Context, paths []string) error
    // PreIngestDelay offers an engine the chance to backpressure ingestions.
    // When called, it may choose to block if the engine determines that it is in
    // or approaching a state where further ingestions may risk its health.
    PreIngestDelay(ctx context.Context)
    // ApproximateDiskBytes returns an approximation of the on-disk size for the given key span.
    ApproximateDiskBytes(from, to roachpb.Key) (uint64, error)
    // CompactRange ensures that the specified range of key value pairs is
    // optimized for space efficiency. The forceBottommost parameter ensures
    // that the key range is compacted all the way to the bottommost level of
    // SSTables, which is necessary to pick up changes to bloom filters.
    CompactRange(start, end roachpb.Key, forceBottommost bool) error
    // InMem returns true if the receiver is an in-memory engine and false
    // otherwise.
    //
    // TODO(peter): This is a bit of a wart in the interface. It is used by
    // addSSTablePreApply to select alternate code paths, but really there should
    // be a unified code path there.
    InMem() bool

    // Filesystem functionality.
    fs.FS
    // ReadFile reads the content from the file with the given filename int this RocksDB's env.
    ReadFile(filename string) ([]byte, error)
    // WriteFile writes data to a file in this RocksDB's env.
    WriteFile(filename string, data []byte) error
    // CreateCheckpoint creates a checkpoint of the engine in the given directory,
    // which must not exist. The directory should be on the same file system so
    // that hard links can be used.
    CreateCheckpoint(dir string) error
}

Engine is the interface that wraps the core operations of a key/value store.

func NewDefaultEngine Uses

func NewDefaultEngine(cacheSize int64, storageConfig base.StorageConfig) (Engine, error)

NewDefaultEngine allocates and returns a new, opened engine with the default configuration. The caller must call the engine's Close method when the engine is no longer needed.

func NewDefaultInMem Uses

func NewDefaultInMem() Engine

NewDefaultInMem allocates and returns a new, opened in-memory engine with the default configuration. The caller must call the engine's Close method when the engine is no longer needed.

func NewEngine Uses

func NewEngine(cacheSize int64, storageConfig base.StorageConfig) (Engine, error)

NewEngine creates a new storage engine.

func NewInMem Uses

func NewInMem(ctx context.Context, attrs roachpb.Attributes, cacheSize int64) Engine

NewInMem allocates and returns a new, opened in-memory engine. The caller must call the engine's Close method when the engine is no longer needed.

FIXME(tschottdorf): make the signature similar to NewPebble (require a cfg).

type EngineIterator Uses

type EngineIterator interface {
    // Close frees up resources held by the iterator.
    Close()
    // SeekEngineKeyGE advances the iterator to the first key in the engine
    // which is >= the provided key.
    SeekEngineKeyGE(key EngineKey) (valid bool, err error)
    // SeekEngineKeyLT advances the iterator to the first key in the engine
    // which is < the provided key.
    SeekEngineKeyLT(key EngineKey) (valid bool, err error)
    // NextEngineKey advances the iterator to the next key/value in the
    // iteration. After this call, valid will be true if the iterator was not
    // originally positioned at the last key. Note that unlike
    // MVCCIterator.NextKey, this method does not skip other versions with the
    // same EngineKey.Key.
    // TODO(sumeer): change MVCCIterator.Next() to match the
    // return values, change all its callers, and rename this
    // to Next().
    NextEngineKey() (valid bool, err error)
    // PrevEngineKey moves the iterator backward to the previous key/value in
    // the iteration. After this call, valid will be true if the iterator was
    // not originally positioned at the first key.
    PrevEngineKey() (valid bool, err error)
    // UnsafeEngineKey returns the same value as EngineKey, but the memory is
    // invalidated on the next call to {Next,NextKey,Prev,SeekGE,SeekLT,Close}.
    // REQUIRES: latest positioning function returned valid=true.
    UnsafeEngineKey() (EngineKey, error)
    // EngineKey returns the current key.
    // REQUIRES: latest positioning function returned valid=true.
    EngineKey() (EngineKey, error)
    // UnsafeRawEngineKey returns the current raw (encoded) key corresponding to
    // EngineKey. This is a low-level method and callers should avoid using
    // it. This is currently only used by intentInterleavingIter to implement
    // UnsafeRawKey.
    UnsafeRawEngineKey() []byte
    // UnsafeValue returns the same value as Value, but the memory is
    // invalidated on the next call to {Next,NextKey,Prev,SeekGE,SeekLT,Close}.
    // REQUIRES: latest positioning function returned valid=true.
    UnsafeValue() []byte
    // Value returns the current value as a byte slice.
    // REQUIRES: latest positioning function returned valid=true.
    Value() []byte
    // SetUpperBound installs a new upper bound for this iterator.
    SetUpperBound(roachpb.Key)
}

EngineIterator is an iterator over key-value pairs where the key is an EngineKey.

type EngineKey Uses

type EngineKey struct {
    Key     roachpb.Key
    Version []byte
}

EngineKey is the general key type that is stored in the engine. It consists of a roachpb.Key followed by an optional "version". The term "version" is a loose one: often the version is a real version represented as an hlc.Timestamp, but it can also be the suffix of a lock table key containing the lock strength and txn UUID. These special cases have their own types, MVCCKey and LockTableKey. For key kinds that will never have a version, the code has historically used MVCCKey, though future code may be better served by using EngineKey (and we should consider changing all the legacy code).

The version can have the following lengths in addition to 0 length. - Timestamp of MVCC keys: 8 or 12 bytes. - Lock table key: 17 bytes.

func DecodeEngineKey Uses

func DecodeEngineKey(b []byte) (key EngineKey, ok bool)

DecodeEngineKey decodes the given bytes as an EngineKey. This function is similar to enginepb.SplitMVCCKey. TODO(sumeer): consider removing SplitMVCCKey.

func (EngineKey) Copy Uses

func (k EngineKey) Copy() EngineKey

Copy makes a copy of the key.

func (EngineKey) Encode Uses

func (k EngineKey) Encode() []byte

Encode encoded the key.

func (EngineKey) EncodeToBuf Uses

func (k EngineKey) EncodeToBuf(buf []byte) []byte

EncodeToBuf attempts to reuse buf for encoding the key, and if undersized, allocates a new buffer.

func (EngineKey) EncodedLen Uses

func (k EngineKey) EncodedLen() int

EncodedLen returns the encoded length of k.

func (EngineKey) Format Uses

func (k EngineKey) Format(f fmt.State, c rune)

Format implements the fmt.Formatter interface

func (EngineKey) IsLockTableKey Uses

func (k EngineKey) IsLockTableKey() bool

IsLockTableKey returns true if the key can be decoded as a LockTableKey.

func (EngineKey) IsMVCCKey Uses

func (k EngineKey) IsMVCCKey() bool

IsMVCCKey returns true if the key can be decoded as an MVCCKey. This includes the case of an empty timestamp.

func (EngineKey) ToLockTableKey Uses

func (k EngineKey) ToLockTableKey() (LockTableKey, error)

ToLockTableKey constructs a LockTableKey from the EngineKey.

func (EngineKey) ToMVCCKey Uses

func (k EngineKey) ToMVCCKey() (MVCCKey, error)

ToMVCCKey constructs a MVCCKey from the EngineKey.

type EngineKeyFormatter Uses

type EngineKeyFormatter struct {
    // contains filtered or unexported fields
}

EngineKeyFormatter is a fmt.Formatter for EngineKeys.

func (EngineKeyFormatter) Format Uses

func (m EngineKeyFormatter) Format(f fmt.State, c rune)

Format implements the fmt.Formatter interface.

type EnvStats Uses

type EnvStats struct {
    // TotalFiles is the total number of files reported by rocksdb.
    TotalFiles uint64
    // TotalBytes is the total size of files reported by rocksdb.
    TotalBytes uint64
    // ActiveKeyFiles is the number of files using the active data key.
    ActiveKeyFiles uint64
    // ActiveKeyBytes is the size of files using the active data key.
    ActiveKeyBytes uint64
    // EncryptionType is an enum describing the active encryption algorithm.
    // See: ccl/storageccl/engineccl/enginepbccl/key_registry.proto
    EncryptionType int32
    // EncryptionStatus is a serialized enginepbccl/stats.proto::EncryptionStatus protobuf.
    EncryptionStatus []byte
}

EnvStats is a set of RocksDB env stats, including encryption status.

type Error Uses

type Error struct {
    // contains filtered or unexported fields
}

A Error wraps an error returned from a RocksDB operation.

func (*Error) Error Uses

func (err *Error) Error() string

Error implements the error interface.

type IterAndBuf Uses

type IterAndBuf struct {
    // contains filtered or unexported fields
}

IterAndBuf used to pass iterators and buffers between MVCC* calls, allowing reuse without the callers needing to know the particulars.

func GetBufUsingIter Uses

func GetBufUsingIter(iter MVCCIterator) IterAndBuf

GetBufUsingIter returns an IterAndBuf using the supplied iterator.

func GetIterAndBuf Uses

func GetIterAndBuf(reader Reader, opts IterOptions) IterAndBuf

GetIterAndBuf returns an IterAndBuf for passing into various MVCC* methods that need to see intents.

func (IterAndBuf) Cleanup Uses

func (b IterAndBuf) Cleanup()

Cleanup must be called to release the resources when done.

type IterOptions Uses

type IterOptions struct {
    // If Prefix is true, Seek will use the user-key prefix of the supplied
    // {MVCC,Engine}Key (the Key field) to restrict which sstables are searched,
    // but iteration (using Next) over keys without the same user-key prefix
    // will not work correctly (keys may be skipped).
    Prefix bool
    // LowerBound gives this iterator an inclusive lower bound. Attempts to
    // SeekReverse or Prev to a key that is strictly less than the bound will
    // invalidate the iterator.
    LowerBound roachpb.Key
    // UpperBound gives this iterator an exclusive upper bound. Attempts to Seek
    // or Next to a key that is greater than or equal to the bound will invalidate
    // the iterator. UpperBound must be provided unless Prefix is true, in which
    // case the end of the prefix will be used as the upper bound.
    UpperBound roachpb.Key
    // If WithStats is true, the iterator accumulates performance
    // counters over its lifetime which can be queried via `Stats()`.
    WithStats bool
    // MinTimestampHint and MaxTimestampHint, if set, indicate that keys outside
    // of the time range formed by [MinTimestampHint, MaxTimestampHint] do not
    // need to be presented by the iterator. The underlying iterator may be able
    // to efficiently skip over keys outside of the hinted time range, e.g., when
    // an SST indicates that it contains no keys within the time range.
    //
    // Note that time bound hints are strictly a performance optimization, and
    // iterators with time bounds hints will frequently return keys outside of the
    // [start, end] time range. If you must guarantee that you never see a key
    // outside of the time bounds, perform your own filtering.
    //
    // These fields are only relevant for MVCCIterators.
    MinTimestampHint, MaxTimestampHint hlc.Timestamp
}

IterOptions contains options used to create an {MVCC,Engine}Iterator.

For performance, every {MVCC,Engine}Iterator must specify either Prefix or UpperBound.

type IteratorStats Uses

type IteratorStats struct {
    InternalDeleteSkippedCount int
    TimeBoundNumSSTs           int
}

IteratorStats is returned from (MVCCIterator).Stats.

type LockTableKey Uses

type LockTableKey struct {
    Key      roachpb.Key
    Strength lock.Strength
    // Slice is of length uuid.Size. We use a slice instead of a byte array, to
    // avoid copying a slice when decoding.
    TxnUUID []byte
}

LockTableKey is a key representing a lock in the lock table.

func (LockTableKey) ToEngineKey Uses

func (lk LockTableKey) ToEngineKey(buf []byte) (EngineKey, []byte)

ToEngineKey converts a lock table key to an EngineKey. buf is used as scratch-space to avoid allocations -- its contents will be overwritten and not appended to.

type MVCCGetOptions Uses

type MVCCGetOptions struct {
    // See the documentation for MVCCGet for information on these parameters.
    Inconsistent     bool
    Tombstones       bool
    FailOnMoreRecent bool
    Txn              *roachpb.Transaction
}

MVCCGetOptions bundles options for the MVCCGet family of functions.

type MVCCIncrementalIterOptions Uses

type MVCCIncrementalIterOptions struct {
    IterOptions IterOptions
    // Keys visible by the MVCCIncrementalIterator must be within (StartTime,
    // EndTime]. Note that if {Min,Max}TimestampHints are specified in
    // IterOptions, the timestamp hints interval should include the start and end
    // time.
    StartTime hlc.Timestamp
    EndTime   hlc.Timestamp
}

MVCCIncrementalIterOptions bundles options for NewMVCCIncrementalIterator.

type MVCCIncrementalIterator Uses

type MVCCIncrementalIterator struct {
    // contains filtered or unexported fields
}

MVCCIncrementalIterator iterates over the diff of the key range [startKey,endKey) and time range (startTime,endTime]. If a key was added or modified between startTime and endTime, the iterator will position at the most recent version (before or at endTime) of that key. If the key was most recently deleted, this is signaled with an empty value.

MVCCIncrementalIterator will return an error if either of the following are encountered:

1. An inline value (non-user data)
2. An intent whose timestamp lies within the time bounds

Note: The endTime is inclusive to be consistent with the non-incremental iterator, where reads at a given timestamp return writes at that timestamp. The startTime is then made exclusive so that iterating time 1 to 2 and then 2 to 3 will only return values with time 2 once. An exclusive start time would normally make it difficult to scan timestamp 0, but CockroachDB uses that as a sentinel for key metadata anyway.

Expected usage:

iter := NewMVCCIncrementalIterator(e, IterOptions{
    StartTime:  startTime,
    EndTime:    endTime,
    UpperBound: endKey,
})
defer iter.Close()
for iter.SeekGE(startKey); ; iter.Next() {
    ok, err := iter.Valid()
    if !ok { ... }
    [code using iter.Key() and iter.Value()]
}
if err := iter.Error(); err != nil {
  ...
}

Note regarding the correctness of the time-bound iterator optimization:

When using (t_s, t_e], say there is a version (committed or provisional) k@t where t is in that interval, that is visible to iter. All sstables containing k@t will be included in timeBoundIter. Note that there may be multiple sequence numbers for the key k@t at the storage layer, say k@t#n1, k@t#n2, where n1 > n2, some of which may be deleted, but the latest sequence number will be visible using iter (since not being visible would be a contradiction of the initial assumption that k@t is visible to iter). Since there is no delete across all sstables that deletes k@t#n1, there is no delete in the subset of sstables used by timeBoundIter that deletes k@t#n1, so the timeBoundIter will see k@t.

NOTE: This is not used by CockroachDB and has been preserved to serve as an oracle to prove the correctness of the new export logic.

func NewMVCCIncrementalIterator Uses

func NewMVCCIncrementalIterator(
    reader Reader, opts MVCCIncrementalIterOptions,
) *MVCCIncrementalIterator

NewMVCCIncrementalIterator creates an MVCCIncrementalIterator with the specified reader and options. The timestamp hint range should not be more restrictive than the start and end time range. TODO(pbardea): Add validation here and in C++ implementation that the

timestamp hints are not more restrictive than incremental iterator's
(startTime, endTime] interval.

func (*MVCCIncrementalIterator) Close Uses

func (i *MVCCIncrementalIterator) Close()

Close frees up resources held by the iterator.

func (*MVCCIncrementalIterator) Key Uses

func (i *MVCCIncrementalIterator) Key() MVCCKey

Key returns the current key.

func (*MVCCIncrementalIterator) Next Uses

func (i *MVCCIncrementalIterator) Next()

Next advances the iterator to the next key/value in the iteration. After this call, Valid() will be true if the iterator was not positioned at the last key.

func (*MVCCIncrementalIterator) NextKey Uses

func (i *MVCCIncrementalIterator) NextKey()

NextKey advances the iterator to the next key. This operation is distinct from Next which advances to the next version of the current key or the next key if the iterator is currently located at the last version for a key.

func (*MVCCIncrementalIterator) SeekGE Uses

func (i *MVCCIncrementalIterator) SeekGE(startKey MVCCKey)

SeekGE advances the iterator to the first key in the engine which is >= the provided key. startKey should be a metadata key to ensure that the iterator has a chance to observe any intents on the key if they are there.

func (*MVCCIncrementalIterator) UnsafeKey Uses

func (i *MVCCIncrementalIterator) UnsafeKey() MVCCKey

UnsafeKey returns the same key as Key, but the memory is invalidated on the next call to {Next,Reset,Close}.

func (*MVCCIncrementalIterator) UnsafeValue Uses

func (i *MVCCIncrementalIterator) UnsafeValue() []byte

UnsafeValue returns the same value as Value, but the memory is invalidated on the next call to {Next,Reset,Close}.

func (*MVCCIncrementalIterator) Valid Uses

func (i *MVCCIncrementalIterator) Valid() (bool, error)

Valid must be called after any call to Reset(), Next(), or similar methods. It returns (true, nil) if the iterator points to a valid key (it is undefined to call Key(), Value(), or similar methods unless Valid() has returned (true, nil)). It returns (false, nil) if the iterator has moved past the end of the valid range, or (false, err) if an error has occurred. Valid() will never return true with a non-nil error.

func (*MVCCIncrementalIterator) Value Uses

func (i *MVCCIncrementalIterator) Value() []byte

Value returns the current value as a byte slice.

type MVCCIterKind Uses

type MVCCIterKind int

MVCCIterKind is used to inform Reader about the kind of iteration desired by the caller.

const (
    // MVCCKeyAndIntentsIterKind specifies that intents must be seen, and appear
    // interleaved with keys, even if they are in a separated lock table.
    MVCCKeyAndIntentsIterKind MVCCIterKind = iota
    // MVCCKeyIterKind specifies that the caller does not need to see intents.
    // Any interleaved intents may be seen, but no correctness properties are
    // derivable from such partial knowledge of intents. NB: this is a performance
    // optimization when iterating over (a) MVCC keys where the caller does
    // not need to see intents, (b) a key space that is known to not have multiple
    // versions (and therefore will never have intents), like the raft log.
    MVCCKeyIterKind
)

"Intent" refers to non-inline meta, that can be interleaved or separated.

type MVCCIterator Uses

type MVCCIterator interface {
    SimpleMVCCIterator

    // SeekLT advances the iterator to the first key in the engine which
    // is < the provided key.
    SeekLT(key MVCCKey)
    // Prev moves the iterator backward to the previous key/value
    // in the iteration. After this call, Valid() will be true if the
    // iterator was not positioned at the first key.
    Prev()
    // Key returns the current key.
    Key() MVCCKey
    // UnsafeRawKey returns the current raw key which could be an encoded
    // MVCCKey, or the more general EngineKey (for a lock table key).
    // This is a low-level and dangerous method since it will expose the
    // raw key of the lock table, i.e., the intentInterleavingIter will not
    // hide the difference between interleaved and separated intents.
    // Callers should be very careful when using this. This is currently
    // only used by callers who are iterating and deleting all data in a
    // range.
    UnsafeRawKey() []byte
    // UnsafeRawMVCCKey returns a serialized MVCCKey. The memory is invalidated
    // on the next call to {Next,NextKey,Prev,SeekGE,SeekLT,Close}. If the
    // iterator is currently positioned at a separated intent (when
    // intentInterleavingIter is used), it makes that intent look like an
    // interleaved intent key, i.e., an MVCCKey with an empty timestamp. This is
    // currently used by callers who pass around key information as a []byte --
    // this seems avoidable, and we should consider cleaning up the callers.
    UnsafeRawMVCCKey() []byte
    // Value returns the current value as a byte slice.
    Value() []byte
    // ValueProto unmarshals the value the iterator is currently
    // pointing to using a protobuf decoder.
    ValueProto(msg protoutil.Message) error
    // When Key() is positioned on an intent, returns true iff this intent
    // (represented by MVCCMetadata) is a separated lock/intent. This is a
    // low-level method that should not be called from outside the storage
    // package. It is part of the exported interface because there are structs
    // outside the package that wrap and implement Iterator.
    IsCurIntentSeparated() bool
    // ComputeStats scans the underlying engine from start to end keys and
    // computes stats counters based on the values. This method is used after a
    // range is split to recompute stats for each subrange. The start key is
    // always adjusted to avoid counting local keys in the event stats are being
    // recomputed for the first range (i.e. the one with start key == KeyMin).
    // The nowNanos arg specifies the wall time in nanoseconds since the
    // epoch and is used to compute the total age of all intents.
    ComputeStats(start, end roachpb.Key, nowNanos int64) (enginepb.MVCCStats, error)
    // FindSplitKey finds a key from the given span such that the left side of
    // the split is roughly targetSize bytes. The returned key will never be
    // chosen from the key ranges listed in keys.NoSplitSpans and will always
    // sort equal to or after minSplitKey.
    //
    // DO NOT CALL directly (except in wrapper MVCCIterator implementations). Use the
    // package-level MVCCFindSplitKey instead. For correct operation, the caller
    // must set the upper bound on the iterator before calling this method.
    FindSplitKey(start, end, minSplitKey roachpb.Key, targetSize int64) (MVCCKey, error)
    // CheckForKeyCollisions checks whether any keys collide between the iterator
    // and the encoded SST data specified, within the provided key range. Returns
    // stats on skipped KVs, or an error if a collision is found.
    CheckForKeyCollisions(sstData []byte, start, end roachpb.Key) (enginepb.MVCCStats, error)
    // SetUpperBound installs a new upper bound for this iterator. The caller can modify
    // the parameter after this function returns.
    SetUpperBound(roachpb.Key)
    // Stats returns statistics about the iterator.
    Stats() IteratorStats
    // SupportsPrev returns true if MVCCIterator implementation supports reverse
    // iteration with Prev() or SeekLT().
    SupportsPrev() bool
}

MVCCIterator is an interface for iterating over key/value pairs in an engine. It is used for iterating over the key space that can have multiple versions, and if often also used (due to historical reasons) for iterating over the key space that never has multiple versions (i.e., MVCCKey.Timestamp.IsEmpty()).

MVCCIterator implementations are thread safe unless otherwise noted.

type MVCCKey Uses

type MVCCKey struct {
    Key       roachpb.Key
    Timestamp hlc.Timestamp
}

MVCCKey is a versioned key, distinguished from roachpb.Key with the addition of a timestamp.

func DecodeMVCCKey Uses

func DecodeMVCCKey(encodedKey []byte) (MVCCKey, error)

DecodeMVCCKey decodes an engine.MVCCKey from its serialized representation.

func MVCCScanDecodeKeyValue Uses

func MVCCScanDecodeKeyValue(repr []byte) (key MVCCKey, value []byte, orepr []byte, err error)

MVCCScanDecodeKeyValue decodes a key/value pair returned in an MVCCScan "batch" (this is not the RocksDB batch repr format), returning both the key/value and the suffix of data remaining in the batch.

func MakeMVCCMetadataKey Uses

func MakeMVCCMetadataKey(key roachpb.Key) MVCCKey

MakeMVCCMetadataKey creates an MVCCKey from a roachpb.Key.

func (MVCCKey) EncodedSize Uses

func (k MVCCKey) EncodedSize() int

EncodedSize returns the size of the MVCCKey when encoded.

func (MVCCKey) Equal Uses

func (k MVCCKey) Equal(l MVCCKey) bool

Equal returns whether two keys are identical.

func (MVCCKey) Format Uses

func (k MVCCKey) Format(f fmt.State, c rune)

Format implements the fmt.Formatter interface.

func (MVCCKey) IsValue Uses

func (k MVCCKey) IsValue() bool

IsValue returns true iff the timestamp is non-zero.

func (MVCCKey) Len Uses

func (k MVCCKey) Len() int

Len returns the size of the MVCCKey when encoded. Implements the pebble.Encodeable interface.

TODO(itsbilal): Reconcile this with EncodedSize. Would require updating MVCC stats tests to reflect the more accurate lengths provided by this function.

func (MVCCKey) Less Uses

func (k MVCCKey) Less(l MVCCKey) bool

Less compares two keys.

func (MVCCKey) Next Uses

func (k MVCCKey) Next() MVCCKey

Next returns the next key.

func (MVCCKey) String Uses

func (k MVCCKey) String() string

String returns a string-formatted version of the key.

type MVCCKeyValue Uses

type MVCCKeyValue struct {
    Key   MVCCKey
    Value []byte
}

MVCCKeyValue contains the raw bytes of the value for a key.

func Scan Uses

func Scan(reader Reader, start, end roachpb.Key, max int64) ([]MVCCKeyValue, error)

Scan returns up to max key/value objects starting from start (inclusive) and ending at end (non-inclusive). Specify max=0 for unbounded scans.

type MVCCLogicalOpDetails Uses

type MVCCLogicalOpDetails struct {
    Txn       enginepb.TxnMeta
    Key       roachpb.Key
    Timestamp hlc.Timestamp

    // Safe indicates that the values in this struct will never be invalidated
    // at a later point. If the details object cannot promise that its values
    // will never be invalidated, an OpLoggerBatch will make a copy of all
    // references before adding it to the log. TestMVCCOpLogWriter fails without
    // this.
    Safe bool
}

MVCCLogicalOpDetails contains details about the occurrence of an MVCC logical operation.

type MVCCLogicalOpType Uses

type MVCCLogicalOpType int

MVCCLogicalOpType is an enum with values corresponding to each of the enginepb.MVCCLogicalOp variants.

LogLogicalOp takes an MVCCLogicalOpType and a corresponding MVCCLogicalOpDetails instead of an enginepb.MVCCLogicalOp variant for two reasons. First, it serves as a form of abstraction so that callers of the method don't need to construct protos themselves. More importantly, it also avoids allocations in the common case where Writer.LogLogicalOp is a no-op. This makes LogLogicalOp essentially free for cases where logical op logging is disabled.

const (
    // MVCCWriteValueOpType corresponds to the MVCCWriteValueOp variant.
    MVCCWriteValueOpType MVCCLogicalOpType = iota
    // MVCCWriteIntentOpType corresponds to the MVCCWriteIntentOp variant.
    MVCCWriteIntentOpType
    // MVCCUpdateIntentOpType corresponds to the MVCCUpdateIntentOp variant.
    MVCCUpdateIntentOpType
    // MVCCCommitIntentOpType corresponds to the MVCCCommitIntentOp variant.
    MVCCCommitIntentOpType
    // MVCCAbortIntentOpType corresponds to the MVCCAbortIntentOp variant.
    MVCCAbortIntentOpType
)

type MVCCScanOptions Uses

type MVCCScanOptions struct {
    // See the documentation for MVCCScan for information on these parameters.
    Inconsistent     bool
    Tombstones       bool
    Reverse          bool
    FailOnMoreRecent bool
    Txn              *roachpb.Transaction
    // MaxKeys is the maximum number of kv pairs returned from this operation.
    // The zero value represents an unbounded scan. If the limit stops the scan,
    // a corresponding ResumeSpan is returned. As a special case, the value -1
    // returns no keys in the result (returning the first key via the
    // ResumeSpan).
    MaxKeys int64
    // TargetBytes is a byte threshold to limit the amount of data pulled into
    // memory during a Scan operation. Once the target is satisfied (i.e. met or
    // exceeded) by the emitted emitted KV pairs, iteration stops (with a
    // ResumeSpan as appropriate). In particular, at least one kv pair is
    // returned (when one exists).
    //
    // The number of bytes a particular kv pair accrues depends on internal data
    // structures, but it is guaranteed to exceed that of the bytes stored in
    // the key and value itself.
    //
    // The zero value indicates no limit.
    TargetBytes int64
}

MVCCScanOptions bundles options for the MVCCScan family of functions.

type MVCCScanResult Uses

type MVCCScanResult struct {
    KVData  [][]byte
    KVs     []roachpb.KeyValue
    NumKeys int64
    // NumBytes is the number of bytes this scan result accrued in terms of the
    // MVCCScanOptions.TargetBytes parameter. This roughly measures the bytes
    // used for encoding the uncompressed kv pairs contained in the result.
    NumBytes int64

    ResumeSpan *roachpb.Span
    Intents    []roachpb.Intent
}

MVCCScanResult groups the values returned from an MVCCScan operation. Depending on the operation invoked, KVData or KVs is populated, but never both.

func MVCCScan Uses

func MVCCScan(
    ctx context.Context,
    reader Reader,
    key, endKey roachpb.Key,
    timestamp hlc.Timestamp,
    opts MVCCScanOptions,
) (MVCCScanResult, error)

MVCCScan scans the key range [key, endKey) in the provided reader up to some maximum number of results in ascending order. If it hits max, it returns a "resume span" to be used in the next call to this function. If the limit is not hit, the resume span will be nil. Otherwise, it will be the sub-span of [key, endKey) that has not been scanned.

For an unbounded scan, specify a max of zero.

Only keys that with a timestamp less than or equal to the supplied timestamp will be included in the scan results. If a transaction is provided and the scan encounters a value with a timestamp between the supplied timestamp and the transaction's max timestamp, an uncertainty error will be returned.

In tombstones mode, if the most recent value for a key is a deletion tombstone, the scan result will contain a roachpb.KeyValue for that key whose RawBytes field is nil. Otherwise, the key-value pair will be omitted from the result entirely.

When scanning inconsistently, any encountered intents will be placed in the dedicated result parameter. By contrast, when scanning consistently, any encountered intents will cause the scan to return a WriteIntentError with the intents embedded within.

Note that transactional scans must be consistent. Put another way, only non-transactional scans may be inconsistent.

When scanning in "fail on more recent" mode, a WriteTooOldError will be returned if the scan observes a version with a timestamp at or above the read timestamp. If the scan observes multiple versions with timestamp at or above the read timestamp, the maximum will be returned in the WriteTooOldError. Similarly, a WriteIntentError will be returned if the scan observes another transaction's intent, even if it has a timestamp above the read timestamp.

func MVCCScanAsTxn Uses

func MVCCScanAsTxn(
    ctx context.Context,
    reader Reader,
    key, endKey roachpb.Key,
    timestamp hlc.Timestamp,
    txnMeta enginepb.TxnMeta,
) (MVCCScanResult, error)

MVCCScanAsTxn constructs a temporary transaction from the given transaction metadata and calls MVCCScan as that transaction. This method is required only for reading intents of a transaction when only its metadata is known and should rarely be used.

The read is carried out without the chance of uncertainty restarts.

func MVCCScanToBytes Uses

func MVCCScanToBytes(
    ctx context.Context,
    reader Reader,
    key, endKey roachpb.Key,
    timestamp hlc.Timestamp,
    opts MVCCScanOptions,
) (MVCCScanResult, error)

MVCCScanToBytes is like MVCCScan, but it returns the results in a byte array.

type MVCCValueMerger Uses

type MVCCValueMerger struct {
    // contains filtered or unexported fields
}

MVCCValueMerger implements the `ValueMerger` interface. It buffers deserialized values in a slice in order specified by `oldToNew`. It determines the order of incoming operands by whether they were added with `MergeNewer()` or `MergeOlder()`, reversing the slice as necessary to ensure operands are always appended. It merges these deserialized operands when `Finish()` is called.

It supports merging either all `roachpb.InternalTimeSeriesData` values or all non-timeseries values. Attempting to merge a mixture of timeseries and non-timeseries values will result in an error.

func (*MVCCValueMerger) Finish Uses

func (t *MVCCValueMerger) Finish(includesBase bool) ([]byte, io.Closer, error)

Finish combines the buffered values from all `Merge*()` calls and marshals the result. In case of non-timeseries the values are simply concatenated from old to new. In case of timeseries the values are sorted, deduplicated, and potentially migrated to columnar format. When deduplicating, only the latest sample for a given offset is retained.

func (*MVCCValueMerger) MergeNewer Uses

func (t *MVCCValueMerger) MergeNewer(value []byte) error

MergeNewer deserializes the value and appends it to the slice corresponding to its type (timeseries or non-timeseries). The slice will be reversed if needed such that it is in old-to-new order.

func (*MVCCValueMerger) MergeOlder Uses

func (t *MVCCValueMerger) MergeOlder(value []byte) error

MergeOlder deserializes the value and appends it to the slice corresponding to its type (timeseries or non-timeseries). The slice will be reversed if needed such that it is in new-to-old order.

type MemFile Uses

type MemFile struct {
    bytes.Buffer
}

MemFile is a file-like struct that buffers all data written to it in memory. Implements the writeCloseSyncer interface and is intended for use with SSTWriter.

func (*MemFile) Close Uses

func (*MemFile) Close() error

Close implements the writeCloseSyncer interface.

func (*MemFile) Data Uses

func (f *MemFile) Data() []byte

Data returns the in-memory buffer behind this MemFile.

func (*MemFile) Sync Uses

func (*MemFile) Sync() error

Sync implements the writeCloseSyncer interface.

type Metrics Uses

type Metrics struct {
    BlockCacheHits                 int64
    BlockCacheMisses               int64
    BlockCacheUsage                int64
    BlockCachePinnedUsage          int64
    BloomFilterPrefixChecked       int64
    BloomFilterPrefixUseful        int64
    DiskSlowCount                  int64
    DiskStallCount                 int64
    MemtableTotalSize              int64
    Flushes                        int64
    FlushedBytes                   int64
    Compactions                    int64
    IngestedBytes                  int64 // Pebble only
    CompactedBytesRead             int64
    CompactedBytesWritten          int64
    TableReadersMemEstimate        int64
    PendingCompactionBytesEstimate int64
    L0FileCount                    int64
    L0SublevelCount                int64
    ReadAmplification              int64
    NumSSTables                    int64
}

Metrics is a set of Engine metrics. Most are described in RocksDB. Some metrics (eg, `IngestedBytes`) are only exposed by Pebble.

Currently, we collect stats from the following sources: 1. RocksDB's internal "tickers" (i.e. counters). They're defined in

rocksdb/statistics.h

2. DBEventListener, which implements RocksDB's EventListener interface. 3. rocksdb::DB::GetProperty().

This is a good resource describing RocksDB's memory-related stats: https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB

TODO(jackson): Refactor to mirror or even expose pebble.Metrics when RocksDB is removed.

type OpLoggerBatch Uses

type OpLoggerBatch struct {
    Batch
    // contains filtered or unexported fields
}

OpLoggerBatch records a log of logical MVCC operations.

func NewOpLoggerBatch Uses

func NewOpLoggerBatch(b Batch) *OpLoggerBatch

NewOpLoggerBatch creates a new batch that logs logical mvcc operations and wraps the provided batch.

func (*OpLoggerBatch) Distinct Uses

func (ol *OpLoggerBatch) Distinct() ReadWriter

Distinct implements the Batch interface.

func (*OpLoggerBatch) LogLogicalOp Uses

func (ol *OpLoggerBatch) LogLogicalOp(op MVCCLogicalOpType, details MVCCLogicalOpDetails)

LogLogicalOp implements the Writer interface.

func (*OpLoggerBatch) LogicalOps Uses

func (ol *OpLoggerBatch) LogicalOps() []enginepb.MVCCLogicalOp

LogicalOps returns the list of all logical MVCC operations that have been recorded by the logger.

type Pebble Uses

type Pebble struct {
    // contains filtered or unexported fields
}

Pebble is a wrapper around a Pebble database instance.

func NewPebble Uses

func NewPebble(ctx context.Context, cfg PebbleConfig) (*Pebble, error)

NewPebble creates a new Pebble instance, at the specified path.

func (*Pebble) ApplyBatchRepr Uses

func (p *Pebble) ApplyBatchRepr(repr []byte, sync bool) error

ApplyBatchRepr implements the Engine interface.

func (*Pebble) ApproximateDiskBytes Uses

func (p *Pebble) ApproximateDiskBytes(from, to roachpb.Key) (uint64, error)

ApproximateDiskBytes implements the Engine interface.

func (*Pebble) Attrs Uses

func (p *Pebble) Attrs() roachpb.Attributes

Attrs implements the Engine interface.

func (*Pebble) Capacity Uses

func (p *Pebble) Capacity() (roachpb.StoreCapacity, error)

Capacity implements the Engine interface.

func (*Pebble) ClearEngineKey Uses

func (p *Pebble) ClearEngineKey(key EngineKey) error

ClearEngineKey implements the Engine interface.

func (*Pebble) ClearIntent Uses

func (p *Pebble) ClearIntent(
    key roachpb.Key, state PrecedingIntentState, txnDidNotUpdateMeta bool, txnUUID uuid.UUID,
) error

ClearIntent implements the Engine interface.

func (*Pebble) ClearIterRange Uses

func (p *Pebble) ClearIterRange(iter MVCCIterator, start, end roachpb.Key) error

ClearIterRange implements the Engine interface.

func (*Pebble) ClearMVCC Uses

func (p *Pebble) ClearMVCC(key MVCCKey) error

ClearMVCC implements the Engine interface.

func (*Pebble) ClearMVCCRange Uses

func (p *Pebble) ClearMVCCRange(start, end MVCCKey) error

ClearMVCCRange implements the Engine interface.

func (*Pebble) ClearMVCCRangeAndIntents Uses

func (p *Pebble) ClearMVCCRangeAndIntents(start, end roachpb.Key) error

ClearMVCCRangeAndIntents implements the Engine interface.

func (*Pebble) ClearRawRange Uses

func (p *Pebble) ClearRawRange(start, end roachpb.Key) error

ClearRawRange implements the Engine interface.

func (*Pebble) ClearUnversioned Uses

func (p *Pebble) ClearUnversioned(key roachpb.Key) error

ClearUnversioned implements the Engine interface.

func (*Pebble) Close Uses

func (p *Pebble) Close()

Close implements the Engine interface.

func (*Pebble) Closed Uses

func (p *Pebble) Closed() bool

Closed implements the Engine interface.

func (*Pebble) Compact Uses

func (p *Pebble) Compact() error

Compact implements the Engine interface.

func (*Pebble) CompactRange Uses

func (p *Pebble) CompactRange(start, end roachpb.Key, forceBottommost bool) error

CompactRange implements the Engine interface.

func (*Pebble) Create Uses

func (p *Pebble) Create(name string) (fs.File, error)

Create implements the FS interface.

func (*Pebble) CreateCheckpoint Uses

func (p *Pebble) CreateCheckpoint(dir string) error

CreateCheckpoint implements the Engine interface.

func (*Pebble) CreateWithSync Uses

func (p *Pebble) CreateWithSync(name string, bytesPerSync int) (fs.File, error)

CreateWithSync implements the FS interface.

func (*Pebble) ExportMVCCToSst Uses

func (p *Pebble) ExportMVCCToSst(
    startKey, endKey roachpb.Key,
    startTS, endTS hlc.Timestamp,
    exportAllRevisions bool,
    targetSize, maxSize uint64,
    io IterOptions,
) ([]byte, roachpb.BulkOpSummary, roachpb.Key, error)

ExportMVCCToSst is part of the engine.Reader interface.

func (*Pebble) Flush Uses

func (p *Pebble) Flush() error

Flush implements the Engine interface.

func (*Pebble) GetAuxiliaryDir Uses

func (p *Pebble) GetAuxiliaryDir() string

GetAuxiliaryDir implements the Engine interface.

func (*Pebble) GetCompactionStats Uses

func (p *Pebble) GetCompactionStats() string

GetCompactionStats implements the Engine interface.

func (*Pebble) GetEncryptionRegistries Uses

func (p *Pebble) GetEncryptionRegistries() (*EncryptionRegistries, error)

GetEncryptionRegistries implements the Engine interface.

func (*Pebble) GetEnvStats Uses

func (p *Pebble) GetEnvStats() (*EnvStats, error)

GetEnvStats implements the Engine interface.

func (*Pebble) GetMetrics Uses

func (p *Pebble) GetMetrics() (*Metrics, error)

GetMetrics implements the Engine interface.

func (*Pebble) InMem Uses

func (p *Pebble) InMem() bool

InMem returns true if the receiver is an in-memory engine and false otherwise.

func (*Pebble) IngestExternalFiles Uses

func (p *Pebble) IngestExternalFiles(ctx context.Context, paths []string) error

IngestExternalFiles implements the Engine interface.

func (p *Pebble) Link(oldname, newname string) error

Link implements the FS interface.

func (*Pebble) List Uses

func (p *Pebble) List(name string) ([]string, error)

List implements the FS interface.

func (*Pebble) LogData Uses

func (p *Pebble) LogData(data []byte) error

LogData implements the Engine interface.

func (*Pebble) LogLogicalOp Uses

func (p *Pebble) LogLogicalOp(op MVCCLogicalOpType, details MVCCLogicalOpDetails)

LogLogicalOp implements the Engine interface.

func (*Pebble) MVCCGet Uses

func (p *Pebble) MVCCGet(key MVCCKey) ([]byte, error)

MVCCGet implements the Engine interface.

func (*Pebble) MVCCGetProto Uses

func (p *Pebble) MVCCGetProto(
    key MVCCKey, msg protoutil.Message,
) (ok bool, keyBytes, valBytes int64, err error)

MVCCGetProto implements the Engine interface.

func (*Pebble) MVCCIterate Uses

func (p *Pebble) MVCCIterate(
    start, end roachpb.Key, iterKind MVCCIterKind, f func(MVCCKeyValue) error,
) error

MVCCIterate implements the Engine interface.

func (*Pebble) Merge Uses

func (p *Pebble) Merge(key MVCCKey, value []byte) error

Merge implements the Engine interface.

func (*Pebble) MkdirAll Uses

func (p *Pebble) MkdirAll(name string) error

MkdirAll implements the FS interface.

func (*Pebble) NewBatch Uses

func (p *Pebble) NewBatch() Batch

NewBatch implements the Engine interface.

func (*Pebble) NewEngineIterator Uses

func (p *Pebble) NewEngineIterator(opts IterOptions) EngineIterator

NewEngineIterator implements the Engine interface.

func (*Pebble) NewMVCCIterator Uses

func (p *Pebble) NewMVCCIterator(iterKind MVCCIterKind, opts IterOptions) MVCCIterator

NewMVCCIterator implements the Engine interface.

func (*Pebble) NewReadOnly Uses

func (p *Pebble) NewReadOnly() ReadWriter

NewReadOnly implements the Engine interface.

func (*Pebble) NewSnapshot Uses

func (p *Pebble) NewSnapshot() Reader

NewSnapshot implements the Engine interface.

func (*Pebble) NewWriteOnlyBatch Uses

func (p *Pebble) NewWriteOnlyBatch() Batch

NewWriteOnlyBatch implements the Engine interface.

func (*Pebble) Open Uses

func (p *Pebble) Open(name string) (fs.File, error)

Open implements the FS interface.

func (*Pebble) OpenDir Uses

func (p *Pebble) OpenDir(name string) (fs.File, error)

OpenDir implements the FS interface.

func (*Pebble) PreIngestDelay Uses

func (p *Pebble) PreIngestDelay(ctx context.Context)

PreIngestDelay implements the Engine interface.

func (*Pebble) PutEngineKey Uses

func (p *Pebble) PutEngineKey(key EngineKey, value []byte) error

PutEngineKey implements the Engine interface.

func (*Pebble) PutIntent Uses

func (p *Pebble) PutIntent(
    key roachpb.Key,
    value []byte,
    state PrecedingIntentState,
    txnDidNotUpdateMeta bool,
    txnUUID uuid.UUID,
) error

PutIntent implements the Engine interface.

func (*Pebble) PutMVCC Uses

func (p *Pebble) PutMVCC(key MVCCKey, value []byte) error

PutMVCC implements the Engine interface.

func (*Pebble) PutUnversioned Uses

func (p *Pebble) PutUnversioned(key roachpb.Key, value []byte) error

PutUnversioned implements the Engine interface.

func (*Pebble) ReadFile Uses

func (p *Pebble) ReadFile(filename string) ([]byte, error)

ReadFile implements the Engine interface.

func (*Pebble) Remove Uses

func (p *Pebble) Remove(filename string) error

Remove implements the FS interface.

func (*Pebble) RemoveAll Uses

func (p *Pebble) RemoveAll(dir string) error

RemoveAll implements the Engine interface.

func (*Pebble) RemoveDir Uses

func (p *Pebble) RemoveDir(name string) error

RemoveDir implements the FS interface.

func (*Pebble) Rename Uses

func (p *Pebble) Rename(oldname, newname string) error

Rename implements the FS interface.

func (*Pebble) SingleClearEngineKey Uses

func (p *Pebble) SingleClearEngineKey(key EngineKey) error

SingleClearEngineKey implements the Engine interface.

func (*Pebble) Stat Uses

func (p *Pebble) Stat(name string) (os.FileInfo, error)

Stat implements the FS interface.

func (*Pebble) String Uses

func (p *Pebble) String() string

func (*Pebble) Type Uses

func (p *Pebble) Type() enginepb.EngineType

Type implements the Engine interface.

func (*Pebble) WriteFile Uses

func (p *Pebble) WriteFile(filename string, data []byte) error

WriteFile writes data to a file in this RocksDB's env.

type PebbleConfig Uses

type PebbleConfig struct {
    // StorageConfig contains storage configs for all storage engines.
    base.StorageConfig
    // Pebble specific options.
    Opts *pebble.Options
}

PebbleConfig holds all configuration parameters and knobs used in setting up a new Pebble instance.

type PebbleFileRegistry Uses

type PebbleFileRegistry struct {

    // The FS to write the file registry file.
    FS  vfs.FS

    // The directory used by the DB. It is used to construct the name of the file registry file and
    // to turn absolute path names of files in this directory into relative path names. The latter
    // is done for compatibility with the file registry implemented for RocksDB, even though it
    // currently requires some potentially non-portable filepath manipulation.
    DBDir string

    // Is the DB read only.
    ReadOnly bool
    // contains filtered or unexported fields
}

PebbleFileRegistry keeps track of files for the data-FS and store-FS for Pebble (see encrypted_fs.go for high-level comment).

It is created even when file registry is disabled, so that it can be used to ensure that a registry file did not exist previously, since that would indicate that disabling the registry can cause data loss.

func (*PebbleFileRegistry) GetFileEntry Uses

func (r *PebbleFileRegistry) GetFileEntry(filename string) *enginepb.FileEntry

GetFileEntry gets the file entry corresponding to filename, if there is one, else returns nil.

func (*PebbleFileRegistry) Load Uses

func (r *PebbleFileRegistry) Load() error

Load loads the contents of the file registry from a file, if the file exists, else it is a noop. It must be called at most once, before the other functions.

func (*PebbleFileRegistry) MaybeDeleteEntry Uses

func (r *PebbleFileRegistry) MaybeDeleteEntry(filename string) error

MaybeDeleteEntry deletes the entry for filename, if it exists, and persists the registry, if changed.

func (*PebbleFileRegistry) MaybeLinkEntry Uses

func (r *PebbleFileRegistry) MaybeLinkEntry(src, dst string) error

MaybeLinkEntry copies the entry under src to dst, if src exists. If src does not exist, but dst exists, dst is deleted. Persists the registry if changed.

func (*PebbleFileRegistry) MaybeRenameEntry Uses

func (r *PebbleFileRegistry) MaybeRenameEntry(src, dst string) error

MaybeRenameEntry moves the entry under src to dst, if src exists. If src does not exist, but dst exists, dst is deleted. Persists the registry if changed.

func (*PebbleFileRegistry) SetFileEntry Uses

func (r *PebbleFileRegistry) SetFileEntry(filename string, entry *enginepb.FileEntry) error

SetFileEntry sets filename => entry in the registry map and persists the registry.

type PrecedingIntentState Uses

type PrecedingIntentState int

PrecedingIntentState is information needed when writing or clearing an intent for a transaction. It specifies the state of the intent that was there before this write (for the specified transaction).

const (
    // ExistingIntentInterleaved specifies that there is an existing intent and
    // that it is interleaved.
    ExistingIntentInterleaved PrecedingIntentState = iota
    // ExistingIntentSeparated specifies that there is an existing intent and
    // that it is separated (in the lock table key space).
    ExistingIntentSeparated
    // NoExistingIntent specifies that there isn't an existing intent.
    NoExistingIntent
)

func (PrecedingIntentState) String Uses

func (is PrecedingIntentState) String() string

type ReadWriter Uses

type ReadWriter interface {
    Reader
    Writer
}

ReadWriter is the read/write interface to an engine's data.

type Reader Uses

type Reader interface {
    // Close closes the reader, freeing up any outstanding resources. Note that
    // various implementations have slightly different behaviors. In particular,
    // Distinct() batches release their parent batch for future use while
    // Engines, Snapshots and Batches free the associated C++ resources.
    Close()
    // Closed returns true if the reader has been closed or is not usable.
    // Objects backed by this reader (e.g. Iterators) can check this to ensure
    // that they are not using a closed engine. Intended for use within package
    // engine; exported to enable wrappers to exist in other packages.
    Closed() bool
    // ExportMVCCToSst exports changes to the keyrange [startKey, endKey) over the
    // interval (startTS, endTS]. Passing exportAllRevisions exports
    // every revision of a key for the interval, otherwise only the latest value
    // within the interval is exported. Deletions are included if all revisions are
    // requested or if the start.Timestamp is non-zero. Returns the bytes of an
    // SSTable containing the exported keys, the size of exported data, or an error.
    //
    // If targetSize is positive, it indicates that the export should produce SSTs
    // which are roughly target size. Specifically, it will return an SST such that
    // the last key is responsible for meeting or exceeding the targetSize. If the
    // resumeKey is non-nil then the data size of the returned sst will be greater
    // than or equal to the targetSize.
    //
    // If maxSize is positive, it is an absolute maximum on byte size for the
    // returned sst. If it is the case that the versions of the last key will lead
    // to an SST that exceeds maxSize, an error will be returned. This parameter
    // exists to prevent creating SSTs which are too large to be used.
    //
    // This function looks at MVCC versions and intents, and returns an error if an
    // intent is found.
    ExportMVCCToSst(
        startKey, endKey roachpb.Key, startTS, endTS hlc.Timestamp,
        exportAllRevisions bool, targetSize uint64, maxSize uint64,
        io IterOptions,
    ) (sst []byte, _ roachpb.BulkOpSummary, resumeKey roachpb.Key, _ error)
    // Get returns the value for the given key, nil otherwise. Semantically, it
    // behaves as if an iterator with MVCCKeyAndIntentsIterKind was used.
    //
    // Deprecated: use storage.MVCCGet instead.
    MVCCGet(key MVCCKey) ([]byte, error)
    // MVCCGetProto fetches the value at the specified key and unmarshals it
    // using a protobuf decoder. Returns true on success or false if the
    // key was not found. On success, returns the length in bytes of the
    // key and the value. Semantically, it behaves as if an iterator with
    // MVCCKeyAndIntentsIterKind was used.
    //
    // Deprecated: use MVCCIterator.ValueProto instead.
    MVCCGetProto(key MVCCKey, msg protoutil.Message) (ok bool, keyBytes, valBytes int64, err error)
    // MVCCIterate scans from the start key to the end key (exclusive), invoking the
    // function f on each key value pair. If f returns an error or if the scan
    // itself encounters an error, the iteration will stop and return the error.
    // If the first result of f is true, the iteration stops and returns a nil
    // error. Note that this method is not expected take into account the
    // timestamp of the end key; all MVCCKeys at end.Key are considered excluded
    // in the iteration.
    MVCCIterate(start, end roachpb.Key, iterKind MVCCIterKind, f func(MVCCKeyValue) error) error
    // NewMVCCIterator returns a new instance of an MVCCIterator over this
    // engine. The caller must invoke MVCCIterator.Close() when finished
    // with the iterator to free resources.
    NewMVCCIterator(iterKind MVCCIterKind, opts IterOptions) MVCCIterator
    // NewEngineIterator returns a new instance of an EngineIterator over this
    // engine. The caller must invoke EngineIterator.Close() when finished
    // with the iterator to free resources. The caller can change IterOptions
    // after this function returns.
    NewEngineIterator(opts IterOptions) EngineIterator
}

Reader is the read interface to an engine's data.

type RocksDBBatchBuilder Uses

type RocksDBBatchBuilder struct {
    // contains filtered or unexported fields
}

RocksDBBatchBuilder is used to construct the RocksDB batch representation. From the RocksDB code, the representation of a batch is:

WriteBatch::rep_ :=
   sequence: fixed64
   count: fixed32
   data: record[count]
record :=
   kTypeValue varstring varstring
   kTypeDeletion varstring
   [...] (see BatchType)
varstring :=
   len: varint32
   data: uint8[len]

The RocksDBBatchBuilder code currently only supports kTypeValue (BatchTypeValue), kTypeDeletion (BatchTypeDeletion), kTypeMerge (BatchTypeMerge), and kTypeSingleDeletion (BatchTypeSingleDeletion) operations. Before a batch is written to the RocksDB write-ahead-log, the sequence number is 0. The "fixed32" format is little endian.

The keys encoded into the batch are MVCC keys: a string key with a timestamp suffix. MVCC keys are encoded as:

<key>[<wall_time>[<logical>[<flags>]]]<#timestamp-bytes>

The <wall_time>, <logical>, and <flags> portions of the key are encoded as 64-bit, 32-bit, and 8-bit big-endian integers, respectively. A custom RocksDB comparator is used to maintain the desired ordering as these keys do not sort lexicographically correctly.

TODO(bilal): This struct exists mostly as a historic artifact. Transition the remaining few test uses of this struct over to pebble.Batch, and remove it entirely.

func (*RocksDBBatchBuilder) Finish Uses

func (b *RocksDBBatchBuilder) Finish() []byte

Finish returns the constructed batch representation. After calling Finish, the builder may be used to construct another batch, but the returned []byte is only valid until the next builder method is called.

func (*RocksDBBatchBuilder) Len Uses

func (b *RocksDBBatchBuilder) Len() int

Len returns the number of bytes currently in the under construction repr.

func (*RocksDBBatchBuilder) Put Uses

func (b *RocksDBBatchBuilder) Put(key MVCCKey, value []byte)

Put sets the given key to the value provided.

It is safe to modify the contents of the arguments after Put returns.

type RocksDBBatchReader Uses

type RocksDBBatchReader struct {
    // contains filtered or unexported fields
}

RocksDBBatchReader is used to iterate the entries in a RocksDB batch representation.

Example: r, err := NewRocksDBBatchReader(...) if err != nil {

return err

} for r.Next() {

	 switch r.BatchType() {
	 case BatchTypeDeletion:
	   fmt.Printf("delete(%x)", r.Key())
	 case BatchTypeValue:
	   fmt.Printf("put(%x,%x)", r.Key(), r.Value())
	 case BatchTypeMerge:
	   fmt.Printf("merge(%x,%x)", r.Key(), r.Value())
  case BatchTypeSingleDeletion:
	   fmt.Printf("single_delete(%x)", r.Key())
  case BatchTypeRangeDeletion:
	   fmt.Printf("delete_range(%x,%x)", r.Key(), r.Value())
	 }

} if err := r.Error(); err != nil {

return err

}

func NewRocksDBBatchReader Uses

func NewRocksDBBatchReader(repr []byte) (*RocksDBBatchReader, error)

NewRocksDBBatchReader creates a RocksDBBatchReader from the given repr and verifies the header.

func (*RocksDBBatchReader) BatchType Uses

func (r *RocksDBBatchReader) BatchType() BatchType

BatchType returns the type of the current batch entry.

func (*RocksDBBatchReader) Count Uses

func (r *RocksDBBatchReader) Count() int

Count returns the declared number of entries in the batch.

func (*RocksDBBatchReader) EngineKey Uses

func (r *RocksDBBatchReader) EngineKey() (EngineKey, error)

EngineKey returns the EngineKey for the current batch entry.

func (*RocksDBBatchReader) Error Uses

func (r *RocksDBBatchReader) Error() error

Error returns the error, if any, which the iterator encountered.

func (*RocksDBBatchReader) Key Uses

func (r *RocksDBBatchReader) Key() []byte

Key returns the key of the current batch entry.

func (*RocksDBBatchReader) MVCCEndKey Uses

func (r *RocksDBBatchReader) MVCCEndKey() (MVCCKey, error)

MVCCEndKey returns the MVCC end key of the current batch entry.

func (*RocksDBBatchReader) MVCCKey Uses

func (r *RocksDBBatchReader) MVCCKey() (MVCCKey, error)

MVCCKey returns the MVCC key of the current batch entry.

func (*RocksDBBatchReader) Next Uses

func (r *RocksDBBatchReader) Next() bool

Next advances to the next entry in the batch, returning false when the batch is empty.

func (*RocksDBBatchReader) Value Uses

func (r *RocksDBBatchReader) Value() []byte

Value returns the value of the current batch entry. Value panics if the BatchType is BatchTypeDeleted.

type RowCounter Uses

type RowCounter struct {
    roachpb.BulkOpSummary
    // contains filtered or unexported fields
}

RowCounter is a helper that counts how many distinct rows appear in the KVs that is is shown via `Count`. Note: the `DataSize` field of the BulkOpSummary is *not* populated by this and should be set separately.

func (*RowCounter) Count Uses

func (r *RowCounter) Count(key roachpb.Key) error

Count examines each key passed to it and increments the running count when it sees a key that belongs to a new row.

type SSTWriter Uses

type SSTWriter struct {

    // DataSize tracks the total key and value bytes added so far.
    DataSize int64
    // contains filtered or unexported fields
}

SSTWriter writes SSTables.

func MakeBackupSSTWriter Uses

func MakeBackupSSTWriter(f writeCloseSyncer) SSTWriter

MakeBackupSSTWriter creates a new SSTWriter tailored for backup SSTs. These SSTs have bloom filters disabled and format set to LevelDB.

func MakeIngestionSSTWriter Uses

func MakeIngestionSSTWriter(f writeCloseSyncer) SSTWriter

MakeIngestionSSTWriter creates a new SSTWriter tailored for ingestion SSTs. These SSTs have bloom filters enabled (as set in DefaultPebbleOptions) and format set to RocksDBv2.

func (*SSTWriter) ApplyBatchRepr Uses

func (fw *SSTWriter) ApplyBatchRepr(repr []byte, sync bool) error

ApplyBatchRepr implements the Writer interface.

func (*SSTWriter) ClearEngineKey Uses

func (fw *SSTWriter) ClearEngineKey(key EngineKey) error

ClearEngineKey implements the Writer interface. An error is returned if it is not greater than any previous point key passed to this Writer (according to the comparator configured during writer creation). `Close` cannot have been called.

func (*SSTWriter) ClearIntent Uses

func (fw *SSTWriter) ClearIntent(
    key roachpb.Key, state PrecedingIntentState, txnDidNotUpdateMeta bool, txnUUID uuid.UUID,
) error

ClearIntent implements the Writer interface. An error is returned if it is not greater than any previous point key passed to this Writer (according to the comparator configured during writer creation). `Close` cannot have been called.

func (*SSTWriter) ClearIterRange Uses

func (fw *SSTWriter) ClearIterRange(iter MVCCIterator, start, end roachpb.Key) error

ClearIterRange implements the Writer interface.

func (*SSTWriter) ClearMVCC Uses

func (fw *SSTWriter) ClearMVCC(key MVCCKey) error

ClearMVCC implements the Writer interface. An error is returned if it is not greater than any previous point key passed to this Writer (according to the comparator configured during writer creation). `Close` cannot have been called.

func (*SSTWriter) ClearMVCCRange Uses

func (fw *SSTWriter) ClearMVCCRange(start, end MVCCKey) error

ClearMVCCRange implements the Writer interface.

func (*SSTWriter) ClearMVCCRangeAndIntents Uses

func (fw *SSTWriter) ClearMVCCRangeAndIntents(start, end roachpb.Key) error

ClearMVCCRangeAndIntents implements the Writer interface.

func (*SSTWriter) ClearRawRange Uses

func (fw *SSTWriter) ClearRawRange(start, end roachpb.Key) error

ClearRawRange implements the Writer interface.

func (*SSTWriter) ClearUnversioned Uses

func (fw *SSTWriter) ClearUnversioned(key roachpb.Key) error

ClearUnversioned implements the Writer interface. An error is returned if it is not greater than any previous point key passed to this Writer (according to the comparator configured during writer creation). `Close` cannot have been called.

func (*SSTWriter) Close Uses

func (fw *SSTWriter) Close()

Close finishes and frees memory and other resources. Close is idempotent.

func (*SSTWriter) Finish Uses

func (fw *SSTWriter) Finish() error

Finish finalizes the writer and returns the constructed file's contents, since the last call to Truncate (if any). At least one kv entry must have been added.

func (*SSTWriter) LogData Uses

func (fw *SSTWriter) LogData(data []byte) error

LogData implements the Writer interface.

func (*SSTWriter) LogLogicalOp Uses

func (fw *SSTWriter) LogLogicalOp(op MVCCLogicalOpType, details MVCCLogicalOpDetails)

LogLogicalOp implements the Writer interface.

func (*SSTWriter) Merge Uses

func (fw *SSTWriter) Merge(key MVCCKey, value []byte) error

Merge implements the Writer interface.

func (*SSTWriter) Put Uses

func (fw *SSTWriter) Put(key MVCCKey, value []byte) error

Put puts a kv entry into the sstable being built. An error is returned if it is not greater than any previously added entry (according to the comparator configured during writer creation). `Close` cannot have been called.

TODO(sumeer): Put has been removed from the Writer interface, but there are many callers of this SSTWriter method. Fix those callers and remove.

func (*SSTWriter) PutEngineKey Uses

func (fw *SSTWriter) PutEngineKey(key EngineKey, value []byte) error

PutEngineKey implements the Writer interface. An error is returned if it is not greater than any previously added entry (according to the comparator configured during writer creation). `Close` cannot have been called.

func (*SSTWriter) PutIntent Uses

func (fw *SSTWriter) PutIntent(
    key roachpb.Key,
    value []byte,
    state PrecedingIntentState,
    txnDidNotUpdateMeta bool,
    txnUUID uuid.UUID,
) error

PutIntent implements the Writer interface. An error is returned if it is not greater than any previously added entry (according to the comparator configured during writer creation). `Close` cannot have been called.

func (*SSTWriter) PutMVCC Uses

func (fw *SSTWriter) PutMVCC(key MVCCKey, value []byte) error

PutMVCC implements the Writer interface. An error is returned if it is not greater than any previously added entry (according to the comparator configured during writer creation). `Close` cannot have been called.

func (*SSTWriter) PutUnversioned Uses

func (fw *SSTWriter) PutUnversioned(key roachpb.Key, value []byte) error

PutUnversioned implements the Writer interface. An error is returned if it is not greater than any previously added entry (according to the comparator configured during writer creation). `Close` cannot have been called.

func (*SSTWriter) SingleClearEngineKey Uses

func (fw *SSTWriter) SingleClearEngineKey(key EngineKey) error

SingleClearEngineKey implements the Writer interface.

type SSTableInfo Uses

type SSTableInfo struct {
    Level int
    Size  int64
    Start MVCCKey
    End   MVCCKey
}

SSTableInfo contains metadata about a single sstable. Note this mirrors the C.DBSSTable struct contents for compatibility with RocksDB.

type SSTableInfos Uses

type SSTableInfos []SSTableInfo

SSTableInfos is a slice of SSTableInfo structures.

func (SSTableInfos) Len Uses

func (s SSTableInfos) Len() int

func (SSTableInfos) Less Uses

func (s SSTableInfos) Less(i, j int) bool

func (SSTableInfos) Swap Uses

func (s SSTableInfos) Swap(i, j int)

type SSTableInfosByLevel Uses

type SSTableInfosByLevel struct {
    // contains filtered or unexported fields
}

SSTableInfosByLevel maintains slices of SSTableInfo objects, one per level. The slice for each level contains the SSTableInfo objects for SSTables at that level, sorted by start key.

func NewSSTableInfosByLevel Uses

func NewSSTableInfosByLevel(s SSTableInfos) SSTableInfosByLevel

NewSSTableInfosByLevel returns a new SSTableInfosByLevel object based on the supplied SSTableInfos slice.

func (*SSTableInfosByLevel) MaxLevel Uses

func (s *SSTableInfosByLevel) MaxLevel() int

MaxLevel returns the maximum level for which there are SSTables.

func (*SSTableInfosByLevel) MaxLevelSpanOverlapsContiguousSSTables Uses

func (s *SSTableInfosByLevel) MaxLevelSpanOverlapsContiguousSSTables(span roachpb.Span) int

MaxLevelSpanOverlapsContiguousSSTables returns the maximum level at which the specified key span overlaps either none, one, or at most two contiguous SSTables. Level 0 is returned if no level qualifies.

This is useful when considering when to merge two compactions. In this case, the method is called with the "gap" between the two spans to be compacted. When the result is that the gap span touches at most two SSTables at a high level, it suggests that merging the two compactions is a good idea (as the up to two SSTables touched by the gap span, due to containing endpoints of the existing compactions, would be rewritten anyway).

As an example, consider the following sstables in a small database:

Level 0.

{Level: 0, Size: 20, Start: key("a"), End: key("z")},
{Level: 0, Size: 15, Start: key("a"), End: key("k")},

Level 2.

{Level: 2, Size: 200, Start: key("a"), End: key("j")},
{Level: 2, Size: 100, Start: key("k"), End: key("o")},
{Level: 2, Size: 100, Start: key("r"), End: key("t")},

Level 6.

{Level: 6, Size: 201, Start: key("a"), End: key("c")},
{Level: 6, Size: 200, Start: key("d"), End: key("f")},
{Level: 6, Size: 300, Start: key("h"), End: key("r")},
{Level: 6, Size: 405, Start: key("s"), End: key("z")},

- The span "a"-"c" overlaps only a single SSTable at the max level

(L6). That's great, so we definitely want to compact that.

- The span "s"-"t" overlaps zero SSTables at the max level (L6).

Again, great! That means we're going to compact the 3rd L2
SSTable and maybe push that directly to L6.

type SimpleMVCCIterator Uses

type SimpleMVCCIterator interface {
    // Close frees up resources held by the iterator.
    Close()
    // SeekGE advances the iterator to the first key in the engine which
    // is >= the provided key.
    SeekGE(key MVCCKey)
    // Valid must be called after any call to Seek(), Next(), Prev(), or
    // similar methods. It returns (true, nil) if the iterator points to
    // a valid key (it is undefined to call Key(), Value(), or similar
    // methods unless Valid() has returned (true, nil)). It returns
    // (false, nil) if the iterator has moved past the end of the valid
    // range, or (false, err) if an error has occurred. Valid() will
    // never return true with a non-nil error.
    Valid() (bool, error)
    // Next advances the iterator to the next key/value in the
    // iteration. After this call, Valid() will be true if the
    // iterator was not positioned at the last key.
    Next()
    // NextKey advances the iterator to the next MVCC key. This operation is
    // distinct from Next which advances to the next version of the current key
    // or the next key if the iterator is currently located at the last version
    // for a key. NextKey must not be used to switch iteration direction from
    // reverse iteration to forward iteration.
    NextKey()
    // UnsafeKey returns the same value as Key, but the memory is invalidated on
    // the next call to {Next,NextKey,Prev,SeekGE,SeekLT,Close}.
    UnsafeKey() MVCCKey
    // UnsafeValue returns the same value as Value, but the memory is
    // invalidated on the next call to {Next,NextKey,Prev,SeekGE,SeekLT,Close}.
    UnsafeValue() []byte
}

SimpleMVCCIterator is an interface for iterating over key/value pairs in an engine. SimpleMVCCIterator implementations are thread safe unless otherwise noted. SimpleMVCCIterator is a subset of the functionality offered by MVCCIterator.

func MakeMultiIterator Uses

func MakeMultiIterator(iters []SimpleMVCCIterator) SimpleMVCCIterator

MakeMultiIterator creates an iterator that multiplexes SimpleMVCCIterators. The caller is responsible for closing the passed iterators after closing the returned multiIterator.

If two iterators have an entry with exactly the same key and timestamp, the one with a higher index in this constructor arg is preferred. The other is skipped.

func NewMemSSTIterator Uses

func NewMemSSTIterator(data []byte, verify bool) (SimpleMVCCIterator, error)

NewMemSSTIterator returns a `SimpleMVCCIterator` for an in-memory sstable. It's compatible with sstables written by `RocksDBSstFileWriter` and Pebble's `sstable.Writer`, and assumes the keys use Cockroach's MVCC format.

func NewSSTIterator Uses

func NewSSTIterator(path string) (SimpleMVCCIterator, error)

NewSSTIterator returns a `SimpleMVCCIterator` for an in-memory sstable. It's compatible with sstables written by `RocksDBSstFileWriter` and Pebble's `sstable.Writer`, and assumes the keys use Cockroach's MVCC format.

type Version Uses

type Version struct {
    Version storageVersion
}

Version stores all the version information for all stores and is used as the format for the version file.

type Writer Uses

type Writer interface {
    // ApplyBatchRepr atomically applies a set of batched updates. Created by
    // calling Repr() on a batch. Using this method is equivalent to constructing
    // and committing a batch whose Repr() equals repr. If sync is true, the
    // batch is synchronously written to disk. It is an error to specify
    // sync=true if the Writer is a Batch.
    //
    // It is safe to modify the contents of the arguments after ApplyBatchRepr
    // returns.
    ApplyBatchRepr(repr []byte, sync bool) error

    // ClearMVCC removes the item from the db with the given MVCCKey. It
    // requires that the timestamp is non-empty (see
    // {ClearUnversioned,ClearIntent} if the timestamp is empty). Note that
    // clear actually removes entries from the storage engine, rather than
    // inserting MVCC tombstones.
    //
    // It is safe to modify the contents of the arguments after it returns.
    ClearMVCC(key MVCCKey) error
    // ClearUnversioned removes an unversioned item from the db. It is for use
    // with inline metadata (not intents) and other unversioned keys (like
    // Range-ID local keys).
    //
    // It is safe to modify the contents of the arguments after it returns.
    ClearUnversioned(key roachpb.Key) error
    // ClearIntent removes an intent from the db. Unlike
    // {ClearMVCC,ClearUnversioned} this is a higher-level method that may make
    // changes in parts of the key space that are not only a function of the
    // input, and may choose to use a single-clear under the covers.
    // txnDidNotUpdateMeta allows for performance optimization when set to true,
    // and has semantics defined in MVCCMetadata.TxnDidNotUpdateMeta (it can
    // be conservatively set to false).
    // REQUIRES: state is ExistingIntentInterleaved or ExistingIntentSeparated.
    //
    // It is safe to modify the contents of the arguments after it returns.
    //
    // TODO(sumeer): after the full transition to separated locks, measure the
    // cost of a PutIntent implementation, where there is an existing intent,
    // that does a <single-clear, put> pair. If there isn't a performance
    // decrease, we can stop tracking txnDidNotUpdateMeta and still optimize
    // ClearIntent by always doing single-clear.
    ClearIntent(
        key roachpb.Key, state PrecedingIntentState, txnDidNotUpdateMeta bool, txnUUID uuid.UUID) error
    // ClearEngineKey removes the item from the db with the given EngineKey.
    // Note that clear actually removes entries from the storage engine. This is
    // a general-purpose and low-level method that should be used sparingly,
    // only when the other Clear* methods are not applicable.
    //
    // It is safe to modify the contents of the arguments after it returns.
    ClearEngineKey(key EngineKey) error

    // ClearRawRange removes a set of entries, from start (inclusive) to end
    // (exclusive). It can be applied to a range consisting of MVCCKeys or the
    // more general EngineKeys -- it simply uses the roachpb.Key parameters as
    // the Key field of an EngineKey. Similar to the other Clear* methods,
    // this method actually removes entries from the storage engine.
    //
    // Note that when used on batches, subsequent reads may not reflect the result
    // of the ClearRawRange.
    //
    // It is safe to modify the contents of the arguments after it returns.
    ClearRawRange(start, end roachpb.Key) error
    // ClearMVCCRangeAndIntents removes MVCC keys and intents from start (inclusive)
    // to end (exclusive). This is a higher-level method that handles both
    // interleaved and separated intents. Similar to the other Clear* methods,
    // this method actually removes entries from the storage engine.
    //
    // Note that when used on batches, subsequent reads may not reflect the result
    // of the ClearMVCCRangeAndIntents.
    //
    // It is safe to modify the contents of the arguments after it returns.
    ClearMVCCRangeAndIntents(start, end roachpb.Key) error
    // ClearMVCCRange removes MVCC keys from start (inclusive) to end
    // (exclusive). It should not be expected to clear intents, though may clear
    // interleaved intents that it encounters. It is meant for efficiently
    // clearing a subset of versions of a key, since the parameters are MVCCKeys
    // and not roachpb.Keys. Similar to the other Clear* methods, this method
    // actually removes entries from the storage engine.
    //
    // Note that when used on batches, subsequent reads may not reflect the result
    // of the ClearMVCCRange.
    //
    // It is safe to modify the contents of the arguments after it returns.
    ClearMVCCRange(start, end MVCCKey) error

    // ClearIterRange removes a set of entries, from start (inclusive) to end
    // (exclusive). Similar to Clear and ClearRange, this method actually
    // removes entries from the storage engine. Unlike ClearRange, the entries
    // to remove are determined by iterating over iter and per-key storage
    // tombstones (not MVCC tombstones) are generated. If the MVCCIterator was
    // constructed using MVCCKeyAndIntentsIterKind, any separated intents/locks
    // will also be cleared.
    //
    // It is safe to modify the contents of the arguments after ClearIterRange
    // returns.
    ClearIterRange(iter MVCCIterator, start, end roachpb.Key) error

    // Merge is a high-performance write operation used for values which are
    // accumulated over several writes. Multiple values can be merged
    // sequentially into a single key; a subsequent read will return a "merged"
    // value which is computed from the original merged values. We only
    // support Merge for keys with no version.
    //
    // Merge currently provides specialized behavior for three data types:
    // integers, byte slices, and time series observations. Merged integers are
    // summed, acting as a high-performance accumulator.  Byte slices are simply
    // concatenated in the order they are merged. Time series observations
    // (stored as byte slices with a special tag on the roachpb.Value) are
    // combined with specialized logic beyond that of simple byte slices.
    //
    //
    // It is safe to modify the contents of the arguments after Merge returns.
    Merge(key MVCCKey, value []byte) error

    // PutMVCC sets the given key to the value provided. It requires that the
    // timestamp is non-empty (see {PutUnversioned,PutIntent} if the timestamp
    // is empty).
    //
    // It is safe to modify the contents of the arguments after Put returns.
    PutMVCC(key MVCCKey, value []byte) error
    // PutUnversioned sets the given key to the value provided. It is for use
    // with inline metadata (not intents) and other unversioned keys (like
    // Range-ID local keys).
    //
    // It is safe to modify the contents of the arguments after Put returns.
    PutUnversioned(key roachpb.Key, value []byte) error
    // PutIntent puts an intent at the given key to the value provided. This is
    // a higher-level method that may make changes in parts of the key space
    // that are not only a function of the input key, and may explicitly clear
    // the preceding intent. txnDidNotUpdateMeta defines what happened prior to
    // this put, and allows for performance optimization when set to true, and
    // has semantics defined in MVCCMetadata.TxnDidNotUpdateMeta (it can be
    // conservatively set to false).
    //
    // It is safe to modify the contents of the arguments after Put returns.
    PutIntent(
        key roachpb.Key, value []byte, state PrecedingIntentState, txnDidNotUpdateMeta bool,
        txnUUID uuid.UUID) error
    // PutEngineKey sets the given key to the value provided. This is a
    // general-purpose and low-level method that should be used sparingly,
    // only when the other Put* methods are not applicable.
    //
    // It is safe to modify the contents of the arguments after Put returns.
    PutEngineKey(key EngineKey, value []byte) error

    // LogData adds the specified data to the RocksDB WAL. The data is
    // uninterpreted by RocksDB (i.e. not added to the memtable or sstables).
    //
    // It is safe to modify the contents of the arguments after LogData returns.
    LogData(data []byte) error
    // LogLogicalOp logs the specified logical mvcc operation with the provided
    // details to the writer, if it has logical op logging enabled. For most
    // Writer implementations, this is a no-op.
    LogLogicalOp(op MVCCLogicalOpType, details MVCCLogicalOpDetails)

    // SingleClearEngineKey removes the most recent write to the item from the db
    // with the given key. Whether older writes of the item will come back
    // to life if not also removed with SingleClear is undefined. See the
    // following:
    //   https://github.com/facebook/rocksdb/wiki/Single-Delete
    // for details on the SingleDelete operation that this method invokes. Note
    // that clear actually removes entries from the storage engine, rather than
    // inserting MVCC tombstones. This is a low-level interface that must not be
    // called from outside the storage package. It is part of the interface
    // because there are structs that wrap Writer and implement the Writer
    // interface, that are not part of the storage package.
    //
    // It is safe to modify the contents of the arguments after it returns.
    SingleClearEngineKey(key EngineKey) error
}

Writer is the write interface to an engine's data.

Directories

PathSynopsis
cloud
cloudimpl
cloudimpl/filetable
enginepb
fs
metamorphic

Package storage imports 53 packages (graph) and is imported by 399 packages. Updated 2020-12-01. Refresh now. Tools for package owners.