cockroach: github.com/cockroachdb/cockroach/pkg/storage/engine Index | Files | Directories

package engine

import "github.com/cockroachdb/cockroach/pkg/storage/engine"

Package engine provides low-level storage. It interacts with storage backends (e.g. LevelDB, RocksDB, etc.) via the Engine interface. At one level higher, MVCC provides multi-version concurrency control capability on top of an Engine instance.

The Engine interface provides an API for key-value stores. InMem implements an in-memory engine using a sorted map. RocksDB implements an engine for data stored to local disk using RocksDB, a variant of LevelDB.

MVCC provides a multi-version concurrency control system on top of an engine. MVCC is the basis for Cockroach's support for distributed transactions. It is intended for direct use from storage.Range objects.

Notes on MVCC architecture

Each MVCC value contains a metadata key/value pair and one or more version key/value pairs. The MVCC metadata key is the actual key for the value, using the util/encoding.EncodeBytes scheme. The MVCC metadata value is of type MVCCMetadata and contains the most recent version timestamp and an optional roachpb.Transaction message. If set, the most recent version of the MVCC value is a transactional "intent". It also contains some information on the size of the most recent version's key and value for efficient stat counter computations. Note that it is not necessary to explicitly store the MVCC metadata as its contents can be reconstructed from the most recent versioned value as long as an intent is not present. The implementation takes advantage of this and deletes the MVCC metadata when possible.

Each MVCC version key/value pair has a key which is also binary-encoded, but is suffixed with a decreasing, big-endian encoding of the timestamp (eight bytes for the nanosecond wall time, followed by four bytes for the logical time except for meta key value pairs, for which the timestamp is implicit). The MVCC version value is a message of type roachpb.Value. A deletion is indicated by an empty value. Note that an empty roachpb.Value will encode to a non-empty byte slice. The decreasing encoding on the timestamp sorts the most recent version directly after the metadata key, which is treated specially by the RocksDB comparator (by making the zero timestamp sort first). This increases the likelihood that an Engine.Get() of the MVCC metadata will get the same block containing the most recent version, even if there are many versions. We rely on getting the MVCC metadata key/value and then using it to directly get the MVCC version using the metadata's most recent version timestamp. This avoids using an expensive merge iterator to scan the most recent version. It also allows us to leverage RocksDB's bloom filters.

The following is an example of the sort order for MVCC key/value pairs:

...
keyA: MVCCMetadata of keyA
keyA_Timestamp_n: value of version_n
keyA_Timestamp_n-1: value of version_n-1
...
keyA_Timestamp_0: value of version_0
keyB: MVCCMetadata of keyB

The binary encoding used on the MVCC keys allows arbitrary keys to be stored in the map (no restrictions on intermediate nil-bytes, for example), while still sorting lexicographically and guaranteeing that all timestamp-suffixed MVCC version keys sort consecutively with the metadata key. We use an escape-based encoding which transforms all nul ("\x00") characters in the key and is terminated with the sequence "\x00\x01", which is guaranteed to not occur elsewhere in the encoded value. See util/encoding/encoding.go for more details.

We considered inlining the most recent MVCC version in the MVCCMetadata. This would reduce the storage overhead of storing the same key twice (which is small due to block compression), and the runtime overhead of two separate DB lookups. On the other hand, all writes that create a new version of an existing key would incur a double write as the previous value is moved out of the MVCCMetadata into its versioned key. Preliminary benchmarks have not shown enough performance improvement to justify this change, although we may revisit this decision if it turns out that multiple versions of the same key are rare in practice.

However, we do allow inlining in order to use the MVCC interface to store non-versioned values. It turns out that not everything which Cockroach needs to store would be efficient or possible using MVCC. Examples include transaction records, abort span entries, stats counters, time series data, and system-local config values. However, supporting a mix of encodings is problematic in terms of resulting complexity. So Cockroach treats an MVCC timestamp of zero to mean an inlined, non-versioned value. These values are replaced if they exist on a Put operation and are cleared from the engine on a delete. Importantly, zero-timestamped MVCC values may be merged, as is necessary for stats counters and time series data.

Index

Package Files

batch.go disk_map.go doc.go engine.go gc.go in_mem.go merge.go multi_iterator.go mvcc.go mvcc_logical_ops.go rocksdb.go rocksdb_64bit.go rocksdb_error.go rocksdb_error_dict.go rocksdb_jemalloc.go slice.go slice_go1.9.go sst_iterator.go temp_dir.go temp_engine.go version.go

Constants

const (
    // RecommendedMaxOpenFiles is the recommended value for RocksDB's
    // max_open_files option.
    RecommendedMaxOpenFiles = 10000
    // MinimumMaxOpenFiles is the minimum value that RocksDB's max_open_files
    // option can be set to. While this should be set as high as possible, the
    // minimum total for a single store node must be under 2048 for Windows
    // compatibility. See:
    // https://wpdev.uservoice.com/forums/266908-command-prompt-console-bash-on-ubuntu-on-windo/suggestions/17310124-add-ability-to-change-max-number-of-open-files-for
    MinimumMaxOpenFiles = 1700
)
const (
    // MVCCVersionTimestampSize is the size of the timestamp portion of MVCC
    // version keys (used to update stats).
    MVCCVersionTimestampSize int64 = 12
)

Variables

var (
    // MVCCKeyMax is a maximum mvcc-encoded key value which sorts after
    // all other keys.
    MVCCKeyMax = MakeMVCCMetadataKey(roachpb.KeyMax)
    // NilKey is the nil MVCCKey.
    NilKey = MVCCKey{}
)
var MVCCComparer = &pebble.Comparer{
    Compare: MVCCKeyCompare,
    AbbreviatedKey: func(k []byte) uint64 {
        key, _, ok := enginepb.SplitMVCCKey(k)
        if !ok {
            return 0
        }
        return pebble.DefaultComparer.AbbreviatedKey(key)
    },

    Format: func(k []byte) fmt.Formatter {
        decoded, err := DecodeMVCCKey(k)
        if err != nil {
            return mvccKeyFormatter{err: err}
        }
        return mvccKeyFormatter{key: decoded}
    },

    Separator: func(dst, a, b []byte) []byte {
        return append(dst, a...)
    },

    Successor: func(dst, a []byte) []byte {
        return append(dst, a...)
    },
    Split: func(k []byte) int {
        if len(k) == 0 {
            return len(k)
        }

        tsLen := int(k[len(k)-1])
        keyPartEnd := len(k) - 1 - tsLen
        if keyPartEnd < 0 {
            return len(k)
        }
        return keyPartEnd
    },

    Name: "cockroach_comparator",
}

MVCCComparer is a pebble.Comparer object that implements MVCC-specific comparator settings for use with Pebble.

TODO(itsbilal): Move this to a new file pebble.go.

func CheckForKeyCollisions Uses

func CheckForKeyCollisions(existingIter Iterator, sstIter Iterator) (enginepb.MVCCStats, error)

CheckForKeyCollisions indicates if the two iterators collide on any keys.

func CleanupTempDirs Uses

func CleanupTempDirs(recordPath string) error

CleanupTempDirs removes all directories listed in the record file specified by recordPath. It should be invoked before creating any new temporary directories to clean up abandoned temporary directories. It should also be invoked when a newly created temporary directory is no longer needed and needs to be removed from the record file.

func ClearRangeWithHeuristic Uses

func ClearRangeWithHeuristic(eng Reader, writer Writer, start, end MVCCKey) error

ClearRangeWithHeuristic clears the keys from start (inclusive) to end (exclusive). Depending on the number of keys, it will either use ClearRange or ClearRangeIter.

func ComputeStatsGo Uses

func ComputeStatsGo(
    iter SimpleIterator, start, end MVCCKey, nowNanos int64, callbacks ...func(MVCCKey, []byte) error,
) (enginepb.MVCCStats, error)

ComputeStatsGo scans the underlying engine from start to end keys and computes stats counters based on the values. This method is used after a range is split to recompute stats for each subrange. The start key is always adjusted to avoid counting local keys in the event stats are being recomputed for the first range (i.e. the one with start key == KeyMin). The nowNanos arg specifies the wall time in nanoseconds since the epoch and is used to compute the total age of all intents.

Most codepaths will be computing stats on a RocksDB iterator, which is implemented in c++, so iter.ComputeStats will save several cgo calls per kv processed. (Plus, on equal footing, the c++ implementation is slightly faster.) ComputeStatsGo is here for codepaths that have a pure-go implementation of SimpleIterator.

When optional callbacks are specified, they are invoked for each physical key-value pair (i.e. not for implicit meta records), and iteration is aborted on the first error returned from any of them.

Callbacks must copy any data they intend to hold on to.

This implementation must match engine/db.cc:MVCCComputeStatsInternal.

func CreateTempDir Uses

func CreateTempDir(parentDir, prefix string, stopper *stop.Stopper) (string, error)

CreateTempDir creates a temporary directory with a prefix under the given parentDir and returns the absolute path of the temporary directory. It is advised to invoke CleanupTempDirs before creating new temporary directories in cases where the disk is completely full.

func EncodeKey Uses

func EncodeKey(key MVCCKey) []byte

EncodeKey encodes an engine.MVCC key into the RocksDB representation. This encoding must match with the encoding in engine/db.cc:EncodeKey().

func EncodeKeyToBuf Uses

func EncodeKeyToBuf(buf []byte, key MVCCKey) []byte

EncodeKeyToBuf encodes an engine.MVCC key into the RocksDB representation. This encoding must match with the encoding in engine/db.cc:EncodeKey().

func ExportToSst Uses

func ExportToSst(
    ctx context.Context, e Reader, start, end MVCCKey, exportAllRevisions bool, io IterOptions,
) ([]byte, roachpb.BulkOpSummary, error)

ExportToSst exports changes to the keyrange [start.Key, end.Key) over the interval (start.Timestamp, end.Timestamp]. Passing exportAllRevisions exports every revision of a key for the interval, otherwise only the latest value within the interval is exported. Deletions are included if all revisions are requested or if the start.Timestamp is non-zero. Returns the bytes of an SSTable containing the exported keys, the size of exported data, or an error.

func InitRocksDBLogger Uses

func InitRocksDBLogger(ctx context.Context)

InitRocksDBLogger initializes the logger to use for RocksDB log messages. If not called, WARNING, ERROR, and FATAL logs will be output to the normal CockroachDB log.

func IsIntentOf Uses

func IsIntentOf(meta *enginepb.MVCCMetadata, txn *roachpb.Transaction) bool

IsIntentOf returns true if the meta record is an intent of the supplied transaction.

func IsValidSplitKey Uses

func IsValidSplitKey(key roachpb.Key) bool

IsValidSplitKey returns whether the key is a valid split key. Certain key ranges cannot be split (the meta1 span and the system DB span); split keys chosen within any of these ranges are considered invalid. And a split key equal to Meta2KeyMax (\x03\xff\xff) is considered invalid.

func MVCCBlindConditionalPut Uses

func MVCCBlindConditionalPut(
    ctx context.Context,
    engine Writer,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    expVal *roachpb.Value,
    allowIfDoesNotExist CPutMissingBehavior,
    txn *roachpb.Transaction,
) error

MVCCBlindConditionalPut is a fast-path of MVCCConditionalPut. See the MVCCConditionalPut comments for details of the semantics. MVCCBlindConditionalPut skips retrieving the existing metadata for the key requiring the caller to guarantee no versions for the key currently exist.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCBlindInitPut Uses

func MVCCBlindInitPut(
    ctx context.Context,
    engine ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    failOnTombstones bool,
    txn *roachpb.Transaction,
) error

MVCCBlindInitPut is a fast-path of MVCCInitPut. See the MVCCInitPut comments for details of the semantics. MVCCBlindInitPut skips retrieving the existing metadata for the key requiring the caller to guarantee no version for the key currently exist.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCBlindPut Uses

func MVCCBlindPut(
    ctx context.Context,
    engine Writer,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    txn *roachpb.Transaction,
) error

MVCCBlindPut is a fast-path of MVCCPut. See the MVCCPut comments for details of the semantics. MVCCBlindPut skips retrieving the existing metadata for the key requiring the caller to guarantee no versions for the key currently exist in order for stats to be updated properly. If a previous version of the key does exist it is up to the caller to properly account for their existence in updating the stats.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCBlindPutProto Uses

func MVCCBlindPutProto(
    ctx context.Context,
    engine Writer,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    msg protoutil.Message,
    txn *roachpb.Transaction,
) error

MVCCBlindPutProto sets the given key to the protobuf-serialized byte string of msg and the provided timestamp. See MVCCBlindPut for a discussion on this fast-path and when it is appropriate to use.

func MVCCClearTimeRange Uses

func MVCCClearTimeRange(
    ctx context.Context,
    batch ReadWriter,
    ms *enginepb.MVCCStats,
    key, endKey roachpb.Key,
    startTime, endTime hlc.Timestamp,
    maxBatchSize int64,
) (*roachpb.Span, error)

MVCCClearTimeRange clears all MVCC versions within the span [key, endKey) which have timestamps in the span (startTime, endTime]. This can have the apparent effect of "reverting" the range to startTime if all of the older revisions of cleared keys are still available (i.e. have not been GC'ed).

Long runs of keys that all qualify for clearing will be cleared via a single clear-range operation. Once maxBatchSize Clear and ClearRange operations are hit during iteration, the next matching key is instead returned in the resumeSpan. It is possible to exceed maxBatchSize by up to the size of the buffer of keys selected for deletion but not yet flushed (as done to detect long runs for cleaning in a single ClearRange).

This function does not handle the stats computations to determine the correct incremental deltas of clearing these keys (and correctly determining if it does or not not change the live and gc keys) so the caller is responsible for recomputing stats over the resulting span if needed.

func MVCCConditionalPut Uses

func MVCCConditionalPut(
    ctx context.Context,
    engine ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    expVal *roachpb.Value,
    allowIfDoesNotExist CPutMissingBehavior,
    txn *roachpb.Transaction,
) error

MVCCConditionalPut sets the value for a specified key only if the expected value matches. If not, the return a ConditionFailedError containing the actual value.

The condition check reads a value from the key using the same operational timestamp as we use to write a value.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCDelete Uses

func MVCCDelete(
    ctx context.Context,
    engine ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    txn *roachpb.Transaction,
) error

MVCCDelete marks the key deleted so that it will not be returned in future get responses.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCDeleteRange Uses

func MVCCDeleteRange(
    ctx context.Context,
    engine ReadWriter,
    ms *enginepb.MVCCStats,
    key, endKey roachpb.Key,
    max int64,
    timestamp hlc.Timestamp,
    txn *roachpb.Transaction,
    returnKeys bool,
) ([]roachpb.Key, *roachpb.Span, int64, error)

MVCCDeleteRange deletes the range of key/value pairs specified by start and end keys. It returns the range of keys deleted when returnedKeys is set, the next span to resume from, and the number of keys deleted. The returned resume span is nil if max keys aren't processed.

func MVCCFindSplitKey Uses

func MVCCFindSplitKey(
    ctx context.Context, engine Reader, key, endKey roachpb.RKey, targetSize int64,
) (roachpb.Key, error)

MVCCFindSplitKey finds a key from the given span such that the left side of the split is roughly targetSize bytes. The returned key will never be chosen from the key ranges listed in keys.NoSplitSpans.

func MVCCGarbageCollect Uses

func MVCCGarbageCollect(
    ctx context.Context,
    engine ReadWriter,
    ms *enginepb.MVCCStats,
    keys []roachpb.GCRequest_GCKey,
    timestamp hlc.Timestamp,
) error

MVCCGarbageCollect creates an iterator on the engine. In parallel it iterates through the keys listed for garbage collection by the keys slice. The engine iterator is seeked in turn to each listed key, clearing all values with timestamps <= to expiration. The timestamp parameter is used to compute the intent age on GC.

func MVCCGet Uses

func MVCCGet(
    ctx context.Context, eng Reader, key roachpb.Key, timestamp hlc.Timestamp, opts MVCCGetOptions,
) (*roachpb.Value, *roachpb.Intent, error)

MVCCGet returns the most recent value for the specified key whose timestamp is less than or equal to the supplied timestamp. If no such value exists, nil is returned instead.

In tombstones mode, if the most recent value is a deletion tombstone, the result will be a non-nil roachpb.Value whose RawBytes field is nil. Otherwise, a deletion tombstone results in a nil roachpb.Value.

In inconsistent mode, if an intent is encountered, it will be placed in the dedicated return parameter. By contrast, in consistent mode, an intent will generate a WriteIntentError with the intent embedded within, and the intent result parameter will be nil.

Note that transactional gets must be consistent. Put another way, only non-transactional gets may be inconsistent.

func MVCCGetAsTxn Uses

func MVCCGetAsTxn(
    ctx context.Context,
    engine Reader,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    txnMeta enginepb.TxnMeta,
) (*roachpb.Value, *roachpb.Intent, error)

MVCCGetAsTxn constructs a temporary transaction from the given transaction metadata and calls MVCCGet as that transaction. This method is required only for reading intents of a transaction when only its metadata is known and should rarely be used.

The read is carried out without the chance of uncertainty restarts.

func MVCCGetProto Uses

func MVCCGetProto(
    ctx context.Context,
    engine Reader,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    msg protoutil.Message,
    opts MVCCGetOptions,
) (bool, error)

MVCCGetProto fetches the value at the specified key and unmarshals it into msg if msg is non-nil. Returns true on success or false if the key was not found.

See the documentation for MVCCGet for the semantics of the MVCCGetOptions.

func MVCCIncrement Uses

func MVCCIncrement(
    ctx context.Context,
    engine ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    txn *roachpb.Transaction,
    inc int64,
) (int64, error)

MVCCIncrement fetches the value for key, and assuming the value is an "integer" type, increments it by inc and stores the new value. The newly incremented value is returned.

An initial value is read from the key using the same operational timestamp as we use to write a value.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCInitPut Uses

func MVCCInitPut(
    ctx context.Context,
    engine ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    failOnTombstones bool,
    txn *roachpb.Transaction,
) error

MVCCInitPut sets the value for a specified key if the key doesn't exist. It returns a ConditionFailedError when the write fails or if the key exists with an existing value that is different from the supplied value. If failOnTombstones is set to true, tombstones count as mismatched values and will cause a ConditionFailedError.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp paramater is confusing and redundant. See the comment on mvccPutInternal for details.

func MVCCIterate Uses

func MVCCIterate(
    ctx context.Context,
    engine Reader,
    key, endKey roachpb.Key,
    timestamp hlc.Timestamp,
    opts MVCCScanOptions,
    f func(roachpb.KeyValue) (bool, error),
) ([]roachpb.Intent, error)

MVCCIterate iterates over the key range [start,end). At each step of the iteration, f() is invoked with the current key/value pair. If f returns true (done) or an error, the iteration stops and the error is propagated. If the reverse is flag set the iterator will be moved in reverse order.

func MVCCKeyCompare Uses

func MVCCKeyCompare(a, b []byte) int

MVCCKeyCompare compares cockroach keys, including the MVCC timestamps. This assumes these are the keys cockroach usually works with i.e. "user" keys from the point of view of rocksdb.

func MVCCMerge Uses

func MVCCMerge(
    ctx context.Context,
    engine ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
) error

MVCCMerge implements a merge operation. Merge adds integer values, concatenates undifferentiated byte slice values, and efficiently combines time series observations if the roachpb.Value tag value indicates the value byte slice is of type TIMESERIES.

func MVCCPut Uses

func MVCCPut(
    ctx context.Context,
    eng ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    value roachpb.Value,
    txn *roachpb.Transaction,
) error

MVCCPut sets the value for a specified key. It will save the value with different versions according to its timestamp and update the key metadata. The timestamp must be passed as a parameter; using the Timestamp field on the value results in an error.

Note that, when writing transactionally, the txn's timestamps dictate the timestamp of the operation, and the timestamp parameter is confusing and redundant. See the comment on mvccPutInternal for details.

If the timestamp is specified as hlc.Timestamp{}, the value is inlined instead of being written as a timestamp-versioned value. A zero timestamp write to a key precludes a subsequent write using a non-zero timestamp and vice versa. Inlined values require only a single row and never accumulate more than a single value. Successive zero timestamp writes to a key replace the value and deletes clear the value. In addition, zero timestamp values may be merged.

func MVCCPutProto Uses

func MVCCPutProto(
    ctx context.Context,
    engine ReadWriter,
    ms *enginepb.MVCCStats,
    key roachpb.Key,
    timestamp hlc.Timestamp,
    txn *roachpb.Transaction,
    msg protoutil.Message,
) error

MVCCPutProto sets the given key to the protobuf-serialized byte string of msg and the provided timestamp.

func MVCCResolveWriteIntent Uses

func MVCCResolveWriteIntent(
    ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, intent roachpb.Intent,
) error

MVCCResolveWriteIntent either commits or aborts (rolls back) an extant write intent for a given txn according to commit parameter. ResolveWriteIntent will skip write intents of other txns.

Transaction epochs deserve a bit of explanation. The epoch for a transaction is incremented on transaction retries. A transaction retry is different from an abort. Retries can occur in SSI transactions when the commit timestamp is not equal to the proposed transaction timestamp. On a retry, the epoch is incremented instead of creating an entirely new transaction. This allows the intents that were written on previous runs to serve as locks which prevent concurrent reads from further incrementing the timestamp cache, making further transaction retries less likely.

Because successive retries of a transaction may end up writing to different keys, the epochs serve to classify which intents get committed in the event the transaction succeeds (all those with epoch matching the commit epoch), and which intents get aborted, even if the transaction succeeds.

TODO(tschottdorf): encountered a bug in which a Txn committed with its original timestamp after laying down intents at higher timestamps. Doesn't look like this code here caught that. Shouldn't resolve intents when they're not at the timestamp the Txn mandates them to be.

func MVCCResolveWriteIntentRange Uses

func MVCCResolveWriteIntentRange(
    ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, intent roachpb.Intent, max int64,
) (int64, *roachpb.Span, error)

MVCCResolveWriteIntentRange commits or aborts (rolls back) the range of write intents specified by start and end keys for a given txn. ResolveWriteIntentRange will skip write intents of other txns. Returns the number of intents resolved and a resume span if the max keys limit was exceeded.

func MVCCResolveWriteIntentRangeUsingIter Uses

func MVCCResolveWriteIntentRangeUsingIter(
    ctx context.Context,
    engine ReadWriter,
    iterAndBuf IterAndBuf,
    ms *enginepb.MVCCStats,
    intent roachpb.Intent,
    max int64,
) (int64, *roachpb.Span, error)

MVCCResolveWriteIntentRangeUsingIter commits or aborts (rolls back) the range of write intents specified by start and end keys for a given txn. ResolveWriteIntentRange will skip write intents of other txns. Returns the number of intents resolved and a resume span if the max keys limit was exceeded.

func MVCCResolveWriteIntentUsingIter Uses

func MVCCResolveWriteIntentUsingIter(
    ctx context.Context,
    engine ReadWriter,
    iterAndBuf IterAndBuf,
    ms *enginepb.MVCCStats,
    intent roachpb.Intent,
) error

MVCCResolveWriteIntentUsingIter is a variant of MVCCResolveWriteIntent that uses iterator and buffer passed as parameters (e.g. when used in a loop).

func MVCCScan Uses

func MVCCScan(
    ctx context.Context,
    engine Reader,
    key, endKey roachpb.Key,
    max int64,
    timestamp hlc.Timestamp,
    opts MVCCScanOptions,
) ([]roachpb.KeyValue, *roachpb.Span, []roachpb.Intent, error)

MVCCScan scans the key range [key, endKey) in the provided engine up to some maximum number of results in ascending order. If it hits max, it returns a "resume span" to be used in the next call to this function. If the limit is not hit, the resume span will be nil. Otherwise, it will be the sub-span of [key, endKey) that has not been scanned.

For an unbounded scan, specify a max of MaxInt64. A max of zero means to return no keys at all, which is probably not what you intend.

TODO(benesch): Evaluate whether our behavior when max is zero still makes sense. See #8084 for historical context.

Only keys that with a timestamp less than or equal to the supplied timestamp will be included in the scan results. If a transaction is provided and the scan encounters a value with a timestamp between the supplied timestamp and the transaction's max timestamp, an uncertainty error will be returned.

In tombstones mode, if the most recent value for a key is a deletion tombstone, the scan result will contain a roachpb.KeyValue for that key whose RawBytes field is nil. Otherwise, the key-value pair will be omitted from the result entirely.

When scanning inconsistently, any encountered intents will be placed in the dedicated result parameter. By contrast, when scanning consistently, any encountered intents will cause the scan to return a WriteIntentError with the intents embedded within, and the intents result parameter will be nil. In this case a resume span will be returned; this is the only case in which a resume span is returned alongside a non-nil error.

Note that transactional scans must be consistent. Put another way, only non-transactional scans may be inconsistent.

func MVCCScanToBytes Uses

func MVCCScanToBytes(
    ctx context.Context,
    engine Reader,
    key, endKey roachpb.Key,
    max int64,
    timestamp hlc.Timestamp,
    opts MVCCScanOptions,
) ([]byte, int64, *roachpb.Span, []roachpb.Intent, error)

MVCCScanToBytes is like MVCCScan, but it returns the results in a byte array.

func MakeValue Uses

func MakeValue(meta enginepb.MVCCMetadata) roachpb.Value

MakeValue returns the inline value.

func MergeInternalTimeSeriesData Uses

func MergeInternalTimeSeriesData(
    mergeIntoNil, usePartialMerge bool, sources ...roachpb.InternalTimeSeriesData,
) (roachpb.InternalTimeSeriesData, error)

MergeInternalTimeSeriesData exports the engine's C++ merge logic for InternalTimeSeriesData to higher level packages. This is intended primarily for consumption by high level testing of time series functionality. If mergeIntoNil is true, then the initial state of the merge is taken to be 'nil' and the first operand is merged into nil. If false, the first operand is taken to be the initial state of the merge. If usePartialMerge is true, the operands are merged together using a partial merge operation first, and are then merged in to the initial state. This can combine with mergeIntoNil: the initial state is either 'nil' or the first operand.

func NewPebbleTempEngine Uses

func NewPebbleTempEngine(
    tempStorage base.TempStorageConfig, storeSpec base.StoreSpec,
) (diskmap.Factory, error)

NewPebbleTempEngine creates a new engine for DistSQL processors to use when the working set is larger than can be stored in memory.

func NewTempEngine Uses

func NewTempEngine(
    tempStorage base.TempStorageConfig, storeSpec base.StoreSpec,
) (diskmap.Factory, error)

NewTempEngine creates a new engine for DistSQL processors to use when the working set is larger than can be stored in memory.

func PutProto Uses

func PutProto(
    engine Writer, key MVCCKey, msg protoutil.Message,
) (keyBytes, valBytes int64, err error)

PutProto sets the given key to the protobuf-serialized byte string of msg and the provided timestamp. Returns the length in bytes of key and the value.

Deprecated: use MVCCPutProto instead.

func RecordTempDir Uses

func RecordTempDir(recordPath, tempPath string) error

RecordTempDir records tempPath to the record file specified by recordPath to facilitate cleanup of the temporary directory on subsequent startups.

func RocksDBBatchCount Uses

func RocksDBBatchCount(repr []byte) (int, error)

RocksDBBatchCount provides an efficient way to get the count of mutations in a RocksDB Batch representation.

func RunLDB Uses

func RunLDB(args []string)

RunLDB runs RocksDB's ldb command-line tool. The passed command-line arguments should not include argv[0].

func RunSSTDump Uses

func RunSSTDump(args []string)

RunSSTDump runs RocksDB's sst_dump command-line tool. The passed command-line arguments should not include argv[0].

func SetRocksDBOpenHook Uses

func SetRocksDBOpenHook(fn unsafe.Pointer)

SetRocksDBOpenHook sets the DBOpenHook function that will be called during RocksDB initialization. It is intended to be called by CCL code.

func WriteSyncNoop Uses

func WriteSyncNoop(ctx context.Context, eng Engine) error

WriteSyncNoop carries out a synchronous no-op write to the engine.

type Batch Uses

type Batch interface {
    ReadWriter
    // Commit atomically applies any batched updates to the underlying
    // engine. This is a noop unless the batch was created via NewBatch(). If
    // sync is true, the batch is synchronously committed to disk.
    Commit(sync bool) error
    // Distinct returns a view of the existing batch which only sees writes that
    // were performed before the Distinct batch was created. That is, the
    // returned batch will not read its own writes, but it will read writes to
    // the parent batch performed before the call to Distinct(), except if the
    // parent batch is a WriteOnlyBatch, in which case the Distinct() batch will
    // read from the underlying engine.
    //
    // The returned
    // batch needs to be closed before using the parent batch again. This is used
    // as an optimization to avoid flushing mutations buffered by the batch in
    // situations where we know all of the batched operations are for distinct
    // keys.
    //
    // TODO(tbg): it seems insane that you cannot read from a WriteOnlyBatch but
    // you can read from a Distinct on top of a WriteOnlyBatch but randomly don't
    // see the batch at all. I was personally just bitten by this.
    Distinct() ReadWriter
    // Empty returns whether the batch has been written to or not.
    Empty() bool
    // Len returns the size of the underlying representation of the batch.
    // Because of the batch header, the size of the batch is never 0 and should
    // not be used interchangeably with Empty. The method avoids the memory copy
    // that Repr imposes, but it still may require flushing the batch's mutations.
    Len() int
    // Repr returns the underlying representation of the batch and can be used to
    // reconstitute the batch on a remote node using Writer.ApplyBatchRepr().
    Repr() []byte
}

Batch is the interface for batch specific operations.

type BatchType Uses

type BatchType byte

BatchType represents the type of an entry in an encoded RocksDB batch.

const (
    BatchTypeDeletion BatchType = 0x0
    BatchTypeValue    BatchType = 0x1
    BatchTypeMerge    BatchType = 0x2
    BatchTypeLogData  BatchType = 0x3
    // BatchTypeColumnFamilyDeletion       BatchType = 0x4
    // BatchTypeColumnFamilyValue          BatchType = 0x5
    // BatchTypeColumnFamilyMerge          BatchType = 0x6
    BatchTypeSingleDeletion BatchType = 0x7
    // BatchTypeColumnFamilySingleDeletion BatchType = 0x8
    // BatchTypeBeginPrepareXID            BatchType = 0x9
    // BatchTypeEndPrepareXID              BatchType = 0xA
    // BatchTypeCommitXID                  BatchType = 0xB
    // BatchTypeRollbackXID                BatchType = 0xC
    // BatchTypeNoop                       BatchType = 0xD
    // BatchTypeColumnFamilyRangeDeletion  BatchType = 0xE
    BatchTypeRangeDeletion BatchType = 0xF
)

These constants come from rocksdb/db/dbformat.h.

type CPutMissingBehavior Uses

type CPutMissingBehavior bool

CPutMissingBehavior describes the handling a non-existing expected value.

const (
    // CPutAllowIfMissing is used to indicate a CPut can also succeed when the
    // expected entry does not exist.
    CPutAllowIfMissing CPutMissingBehavior = true
    // CPutFailIfMissing is used to indicate the existing value must match the
    // expected value exactly i.e. if a value is expected, it must exist.
    CPutFailIfMissing CPutMissingBehavior = false
)

type DBFile Uses

type DBFile interface {
    // Append appends data to this DBFile.
    Append(data []byte) error
    // Close closes this DBFile.
    Close() error
    // Sync synchronously flushes this DBFile's data to disk.
    Sync() error
}

DBFile is an interface for interacting with DBWritableFile in RocksDB.

type EncryptionRegistries Uses

type EncryptionRegistries struct {
    // FileRegistry is the list of files with encryption status.
    // serialized storage/engine/enginepb/file_registry.proto::FileRegistry
    FileRegistry []byte
    // KeyRegistry is the list of keys, scrubbed of actual key data.
    // serialized ccl/storageccl/engineccl/enginepbccl/key_registry.proto::DataKeysRegistry
    KeyRegistry []byte
}

EncryptionRegistries contains the encryption-related registries: Both are serialized protobufs.

type Engine Uses

type Engine interface {
    ReadWriter
    // Attrs returns the engine/store attributes.
    Attrs() roachpb.Attributes
    // Capacity returns capacity details for the engine's available storage.
    Capacity() (roachpb.StoreCapacity, error)
    // Flush causes the engine to write all in-memory data to disk
    // immediately.
    Flush() error
    // GetStats retrieves stats from the engine.
    GetStats() (*Stats, error)
    // GetEnvStats retrieves stats about the engine's environment
    // For RocksDB, this includes details of at-rest encryption.
    GetEnvStats() (*EnvStats, error)
    // GetAuxiliaryDir returns a path under which files can be stored
    // persistently, and from which data can be ingested by the engine.
    //
    // Not thread safe.
    GetAuxiliaryDir() string
    // NewBatch returns a new instance of a batched engine which wraps
    // this engine. Batched engines accumulate all mutations and apply
    // them atomically on a call to Commit().
    NewBatch() Batch
    // NewReadOnly returns a new instance of a ReadWriter that wraps
    // this engine. This wrapper panics when unexpected operations (e.g., write
    // operations) are executed on it and caches iterators to avoid the overhead
    // of creating multiple iterators for batched reads.
    NewReadOnly() ReadWriter
    // NewWriteOnlyBatch returns a new instance of a batched engine which wraps
    // this engine. A write-only batch accumulates all mutations and applies them
    // atomically on a call to Commit(). Read operations return an error.
    //
    // TODO(peter): This should return a WriteBatch interface, but there are mild
    // complications in both defining that interface and implementing it. In
    // particular, Batch.Close would no longer come from Reader and we'd need to
    // refactor a bunch of code in rocksDBBatch.
    NewWriteOnlyBatch() Batch
    // NewSnapshot returns a new instance of a read-only snapshot
    // engine. Snapshots are instantaneous and, as long as they're
    // released relatively quickly, inexpensive. Snapshots are released
    // by invoking Close(). Note that snapshots must not be used after the
    // original engine has been stopped.
    NewSnapshot() Reader
    // IngestExternalFiles atomically links a slice of files into the RocksDB
    // log-structured merge-tree. skipWritingSeqNo = true may be passed iff this
    // rocksdb will never be read by versions prior to 5.16. Otherwise, if it is
    // false, ingestion may modify the files (including the underlying file in the
    // case of hard-links) when allowFileModifications true. See additional
    // comments in db.cc's IngestExternalFile explaining modification behavior.
    IngestExternalFiles(ctx context.Context, paths []string, skipWritingSeqNo, allowFileModifications bool) error
    // PreIngestDelay offers an engine the chance to backpressure ingestions.
    // When called, it may choose to block if the engine determines that it is in
    // or approaching a state where further ingestions may risk its health.
    PreIngestDelay(ctx context.Context)
    // ApproximateDiskBytes returns an approximation of the on-disk size for the given key span.
    ApproximateDiskBytes(from, to roachpb.Key) (uint64, error)
    // CompactRange ensures that the specified range of key value pairs is
    // optimized for space efficiency. The forceBottommost parameter ensures
    // that the key range is compacted all the way to the bottommost level of
    // SSTables, which is necessary to pick up changes to bloom filters.
    CompactRange(start, end roachpb.Key, forceBottommost bool) error
    // OpenFile opens a DBFile with the given filename.
    OpenFile(filename string) (DBFile, error)
    // ReadFile reads the content from the file with the given filename int this RocksDB's env.
    ReadFile(filename string) ([]byte, error)
    // DeleteFile deletes the file with the given filename from this RocksDB's env.
    // If the file with given filename doesn't exist, return os.ErrNotExist.
    DeleteFile(filename string) error
    // DeleteDirAndFiles deletes the directory and any files it contains but
    // not subdirectories from this RocksDB's env. If dir does not exist,
    // DeleteDirAndFiles returns nil (no error).
    DeleteDirAndFiles(dir string) error
    // LinkFile creates 'newname' as a hard link to 'oldname'. This is done using
    // the engine implementation. For RocksDB, this means using the Env responsible for the file
    // which may handle extra logic (eg: copy encryption settings for EncryptedEnv).
    LinkFile(oldname, newname string) error
    // CreateCheckpoint creates a checkpoint of the engine in the given directory,
    // which must not exist. The directory should be on the same file system so
    // that hard links can be used.
    CreateCheckpoint(dir string) error
}

Engine is the interface that wraps the core operations of a key/value store.

type EnvStats Uses

type EnvStats struct {
    // TotalFiles is the total number of files reported by rocksdb.
    TotalFiles uint64
    // TotalBytes is the total size of files reported by rocksdb.
    TotalBytes uint64
    // ActiveKeyFiles is the number of files using the active data key.
    ActiveKeyFiles uint64
    // ActiveKeyBytes is the size of files using the active data key.
    ActiveKeyBytes uint64
    // EncryptionType is an enum describing the active encryption algorithm.
    // See: ccl/storageccl/engineccl/enginepbccl/key_registry.proto
    EncryptionType int32
    // EncryptionStatus is a serialized enginepbccl/stats.proto::EncryptionStatus protobuf.
    EncryptionStatus []byte
}

EnvStats is a set of RocksDB env stats, including encryption status.

type GarbageCollector Uses

type GarbageCollector struct {
    Threshold hlc.Timestamp
    // contains filtered or unexported fields
}

GarbageCollector GCs MVCC key/values using a zone-specific GC policy allows either the union or intersection of maximum # of versions and maximum age.

func MakeGarbageCollector Uses

func MakeGarbageCollector(now hlc.Timestamp, policy config.GCPolicy) GarbageCollector

MakeGarbageCollector allocates and returns a new GC, with expiration computed based on current time and policy.TTLSeconds.

func (GarbageCollector) Filter Uses

func (gc GarbageCollector) Filter(keys []MVCCKey, values [][]byte) (int, hlc.Timestamp)

Filter makes decisions about garbage collection based on the garbage collection policy for batches of values for the same key. Returns the index of the first key to be GC'd and the timestamp including, and after which, all values should be garbage collected. If no values should be GC'd, returns -1 for the index and the zero timestamp. Keys must be in descending time order. Values deleted at or before the returned timestamp can be deleted without invalidating any reads in the time interval (gc.expiration, \infinity).

The GC keeps all values (including deletes) above the expiration time, plus the first value before or at the expiration time. This allows reads to be guaranteed as described above. However if this were the only rule, then if the most recent write was a delete, it would never be removed. Thus, when a deleted value is the most recent before expiration, it can be deleted. This would still allow for the tombstone bugs in #6227, so in the future we will add checks that disallow writes before the last GC expiration time.

type InMem Uses

type InMem struct {
    *RocksDB
}

InMem wraps RocksDB and configures it for in-memory only storage.

func NewInMem Uses

func NewInMem(attrs roachpb.Attributes, cacheSize int64) InMem

NewInMem allocates and returns a new, opened InMem engine. The caller must call the engine's Close method when the engine is no longer needed.

FIXME(tschottdorf): make the signature similar to NewRocksDB (require a cfg).

type IterAndBuf Uses

type IterAndBuf struct {
    // contains filtered or unexported fields
}

IterAndBuf used to pass iterators and buffers between MVCC* calls, allowing reuse without the callers needing to know the particulars.

func GetBufUsingIter Uses

func GetBufUsingIter(iter Iterator) IterAndBuf

GetBufUsingIter returns an IterAndBuf using the supplied iterator.

func GetIterAndBuf Uses

func GetIterAndBuf(engine Reader, opts IterOptions) IterAndBuf

GetIterAndBuf returns an IterAndBuf for passing into various MVCC* methods.

func (IterAndBuf) Cleanup Uses

func (b IterAndBuf) Cleanup()

Cleanup must be called to release the resources when done.

type IterOptions Uses

type IterOptions struct {
    // If Prefix is true, Seek will use the user-key prefix of
    // the supplied MVCC key to restrict which sstables are searched,
    // but iteration (using Next) over keys without the same user-key
    // prefix will not work correctly (keys may be skipped).
    Prefix bool
    // LowerBound gives this iterator an inclusive lower bound. Attempts to
    // SeekReverse or Prev to a key that is strictly less than the bound will
    // invalidate the iterator.
    LowerBound roachpb.Key
    // UpperBound gives this iterator an exclusive upper bound. Attempts to Seek
    // or Next to a key that is greater than or equal to the bound will invalidate
    // the iterator. UpperBound must be provided unless Prefix is true, in which
    // case the end of the prefix will be used as the upper bound.
    UpperBound roachpb.Key
    // If WithStats is true, the iterator accumulates RocksDB performance
    // counters over its lifetime which can be queried via `Stats()`.
    WithStats bool
    // MinTimestampHint and MaxTimestampHint, if set, indicate that keys outside
    // of the time range formed by [MinTimestampHint, MaxTimestampHint] do not
    // need to be presented by the iterator. The underlying iterator may be able
    // to efficiently skip over keys outside of the hinted time range, e.g., when
    // an SST indicates that it contains no keys within the time range.
    //
    // Note that time bound hints are strictly a performance optimization, and
    // iterators with time bounds hints will frequently return keys outside of the
    // [start, end] time range. If you must guarantee that you never see a key
    // outside of the time bounds, perform your own filtering.
    MinTimestampHint, MaxTimestampHint hlc.Timestamp
}

IterOptions contains options used to create an Iterator.

For performance, every Iterator must specify either Prefix or UpperBound.

type Iterator Uses

type Iterator interface {
    SimpleIterator

    // SeekReverse advances the iterator to the first key in the engine which
    // is <= the provided key.
    SeekReverse(key MVCCKey)
    // Prev moves the iterator backward to the previous key/value
    // in the iteration. After this call, Valid() will be true if the
    // iterator was not positioned at the first key.
    Prev()
    // PrevKey moves the iterator backward to the previous MVCC key. This
    // operation is distinct from Prev which moves the iterator backward to the
    // prev version of the current key or the prev key if the iterator is
    // currently located at the first version for a key.
    PrevKey()
    // Key returns the current key.
    Key() MVCCKey
    // Value returns the current value as a byte slice.
    Value() []byte
    // ValueProto unmarshals the value the iterator is currently
    // pointing to using a protobuf decoder.
    ValueProto(msg protoutil.Message) error
    // ComputeStats scans the underlying engine from start to end keys and
    // computes stats counters based on the values. This method is used after a
    // range is split to recompute stats for each subrange. The start key is
    // always adjusted to avoid counting local keys in the event stats are being
    // recomputed for the first range (i.e. the one with start key == KeyMin).
    // The nowNanos arg specifies the wall time in nanoseconds since the
    // epoch and is used to compute the total age of all intents.
    ComputeStats(start, end MVCCKey, nowNanos int64) (enginepb.MVCCStats, error)
    // FindSplitKey finds a key from the given span such that the left side of
    // the split is roughly targetSize bytes. The returned key will never be
    // chosen from the key ranges listed in keys.NoSplitSpans and will always
    // sort equal to or after minSplitKey.
    FindSplitKey(start, end, minSplitKey MVCCKey, targetSize int64) (MVCCKey, error)
    // MVCCGet is the internal implementation of the family of package-level
    // MVCCGet functions.
    //
    // There is little reason to use this function directly. Use the package-level
    // MVCCGet, or one of its variants, instead.
    MVCCGet(
        key roachpb.Key, timestamp hlc.Timestamp, opts MVCCGetOptions,
    ) (*roachpb.Value, *roachpb.Intent, error)
    // MVCCScan is the internal implementation of the family of package-level
    // MVCCScan functions. The notable difference is that key/value pairs are
    // returned raw, as a buffer of varint-prefixed slices, alternating from key
    // to value, where numKVs specifies the number of pairs in the buffer.
    //
    // There is little reason to use this function directly. Use the package-level
    // MVCCScan, or one of its variants, instead.
    MVCCScan(
        start, end roachpb.Key, max int64, timestamp hlc.Timestamp, opts MVCCScanOptions,
    ) (kvData []byte, numKVs int64, resumeSpan *roachpb.Span, intents []roachpb.Intent, err error)
    // SetUpperBound installs a new upper bound for this iterator.
    SetUpperBound(roachpb.Key)

    Stats() IteratorStats
}

Iterator is an interface for iterating over key/value pairs in an engine. Iterator implementations are thread safe unless otherwise noted.

type IteratorStats Uses

type IteratorStats struct {
    InternalDeleteSkippedCount int
    TimeBoundNumSSTs           int
}

IteratorStats is returned from (Iterator).Stats.

type MVCCGetOptions Uses

type MVCCGetOptions struct {
    // See the documentation for MVCCGet for information on these parameters.
    Inconsistent bool
    Tombstones   bool
    // TODO(nvanbenschoten): Remove all references to IgnoreSequence in 20.1.
    IgnoreSequence bool
    Txn            *roachpb.Transaction
}

MVCCGetOptions bundles options for the MVCCGet family of functions.

type MVCCKey Uses

type MVCCKey struct {
    Key       roachpb.Key
    Timestamp hlc.Timestamp
}

MVCCKey is a versioned key, distinguished from roachpb.Key with the addition of a timestamp.

func DecodeMVCCKey Uses

func DecodeMVCCKey(encodedKey []byte) (MVCCKey, error)

DecodeMVCCKey decodes an engine.MVCCKey from its serialized representation. This decoding must match engine/db.cc:DecodeKey().

func MVCCScanDecodeKeyValue Uses

func MVCCScanDecodeKeyValue(repr []byte) (key MVCCKey, value []byte, orepr []byte, err error)

MVCCScanDecodeKeyValue decodes a key/value pair returned in an MVCCScan "batch" (this is not the RocksDB batch repr format), returning both the key/value and the suffix of data remaining in the batch.

func MakeMVCCMetadataKey Uses

func MakeMVCCMetadataKey(key roachpb.Key) MVCCKey

MakeMVCCMetadataKey creates an MVCCKey from a roachpb.Key.

func (MVCCKey) EncodedSize Uses

func (k MVCCKey) EncodedSize() int

EncodedSize returns the size of the MVCCKey when encoded.

func (MVCCKey) Equal Uses

func (k MVCCKey) Equal(l MVCCKey) bool

Equal returns whether two keys are identical.

func (MVCCKey) Format Uses

func (k MVCCKey) Format(f fmt.State, c rune)

Format implements the fmt.Formatter interface.

func (MVCCKey) IsValue Uses

func (k MVCCKey) IsValue() bool

IsValue returns true iff the timestamp is non-zero.

func (MVCCKey) Len Uses

func (k MVCCKey) Len() int

Len returns the size of the MVCCKey when encoded. Implements the pebble.Encodeable interface.

TODO(itsbilal): Reconcile this with EncodedSize. Would require updating MVCC stats tests to reflect the more accurate lengths provided by this function.

func (MVCCKey) Less Uses

func (k MVCCKey) Less(l MVCCKey) bool

Less compares two keys.

func (MVCCKey) Next Uses

func (k MVCCKey) Next() MVCCKey

Next returns the next key.

func (MVCCKey) String Uses

func (k MVCCKey) String() string

String returns a string-formatted version of the key.

type MVCCKeyValue Uses

type MVCCKeyValue struct {
    Key   MVCCKey
    Value []byte
}

MVCCKeyValue contains the raw bytes of the value for a key.

func Scan Uses

func Scan(engine Reader, start, end MVCCKey, max int64) ([]MVCCKeyValue, error)

Scan returns up to max key/value objects starting from start (inclusive) and ending at end (non-inclusive). Specify max=0 for unbounded scans.

type MVCCLogicalOpDetails Uses

type MVCCLogicalOpDetails struct {
    Txn       enginepb.TxnMeta
    Key       roachpb.Key
    Timestamp hlc.Timestamp

    // Safe indicates that the values in this struct will never be invalidated
    // at a later point. If the details object cannot promise that its values
    // will never be invalidated, an OpLoggerBatch will make a copy of all
    // references before adding it to the log. TestMVCCOpLogWriter fails without
    // this.
    Safe bool
}

MVCCLogicalOpDetails contains details about the occurrence of an MVCC logical operation.

type MVCCLogicalOpType Uses

type MVCCLogicalOpType int

MVCCLogicalOpType is an enum with values corresponding to each of the enginepb.MVCCLogicalOp variants.

LogLogicalOp takes an MVCCLogicalOpType and a corresponding MVCCLogicalOpDetails instead of an enginepb.MVCCLogicalOp variant for two reasons. First, it serves as a form of abstraction so that callers of the method don't need to construct protos themselves. More importantly, it also avoids allocations in the common case where Writer.LogLogicalOp is a no-op. This makes LogLogicalOp essentially free for cases where logical op logging is disabled.

const (
    // MVCCWriteValueOpType corresponds to the MVCCWriteValueOp variant.
    MVCCWriteValueOpType MVCCLogicalOpType = iota
    // MVCCWriteIntentOpType corresponds to the MVCCWriteIntentOp variant.
    MVCCWriteIntentOpType
    // MVCCUpdateIntentOpType corresponds to the MVCCUpdateIntentOp variant.
    MVCCUpdateIntentOpType
    // MVCCCommitIntentOpType corresponds to the MVCCCommitIntentOp variant.
    MVCCCommitIntentOpType
    // MVCCAbortIntentOpType corresponds to the MVCCAbortIntentOp variant.
    MVCCAbortIntentOpType
)

type MVCCScanOptions Uses

type MVCCScanOptions struct {

    // See the documentation for MVCCScan for information on these parameters.
    Inconsistent bool
    Tombstones   bool
    // TODO(nvanbenschoten): Remove all references to IgnoreSequence in 20.1.
    IgnoreSequence bool
    Reverse        bool
    Txn            *roachpb.Transaction
}

MVCCScanOptions bundles options for the MVCCScan family of functions.

type OpLoggerBatch Uses

type OpLoggerBatch struct {
    Batch
    // contains filtered or unexported fields
}

OpLoggerBatch records a log of logical MVCC operations.

func NewOpLoggerBatch Uses

func NewOpLoggerBatch(b Batch) *OpLoggerBatch

NewOpLoggerBatch creates a new batch that logs logical mvcc operations and wraps the provided batch.

func (*OpLoggerBatch) Distinct Uses

func (ol *OpLoggerBatch) Distinct() ReadWriter

Distinct implements the Batch interface.

func (*OpLoggerBatch) LogLogicalOp Uses

func (ol *OpLoggerBatch) LogLogicalOp(op MVCCLogicalOpType, details MVCCLogicalOpDetails)

LogLogicalOp implements the Writer interface.

func (*OpLoggerBatch) LogicalOps Uses

func (ol *OpLoggerBatch) LogicalOps() []enginepb.MVCCLogicalOp

LogicalOps returns the list of all logical MVCC operations that have been recorded by the logger.

type ReadWriter Uses

type ReadWriter interface {
    Reader
    Writer
}

ReadWriter is the read/write interface to an engine's data.

type Reader Uses

type Reader interface {
    // Close closes the reader, freeing up any outstanding resources. Note that
    // various implementations have slightly different behaviors. In particular,
    // Distinct() batches release their parent batch for future use while
    // Engines, Snapshots and Batches free the associated C++ resources.
    Close()
    // Closed returns true if the reader has been closed or is not usable.
    // Objects backed by this reader (e.g. Iterators) can check this to ensure
    // that they are not using a closed engine. Intended for use within package
    // engine; exported to enable wrappers to exist in other packages.
    Closed() bool
    // Get returns the value for the given key, nil otherwise.
    //
    // Deprecated: use MVCCGet instead.
    Get(key MVCCKey) ([]byte, error)
    // GetProto fetches the value at the specified key and unmarshals it
    // using a protobuf decoder. Returns true on success or false if the
    // key was not found. On success, returns the length in bytes of the
    // key and the value.
    //
    // Deprecated: use Iterator.ValueProto instead.
    GetProto(key MVCCKey, msg protoutil.Message) (ok bool, keyBytes, valBytes int64, err error)
    // Iterate scans from the start key to the end key (exclusive), invoking the
    // function f on each key value pair. If f returns an error or if the scan
    // itself encounters an error, the iteration will stop and return the error.
    // If the first result of f is true, the iteration stops and returns a nil
    // error.
    Iterate(start, end MVCCKey, f func(MVCCKeyValue) (stop bool, err error)) error
    // NewIterator returns a new instance of an Iterator over this
    // engine. The caller must invoke Iterator.Close() when finished
    // with the iterator to free resources.
    NewIterator(opts IterOptions) Iterator
}

Reader is the read interface to an engine's data.

type RocksDB Uses

type RocksDB struct {
    // contains filtered or unexported fields
}

RocksDB is a wrapper around a RocksDB database instance.

func NewRocksDB Uses

func NewRocksDB(cfg RocksDBConfig, cache RocksDBCache) (*RocksDB, error)

NewRocksDB allocates and returns a new RocksDB object. This creates options and opens the database. If the database doesn't yet exist at the specified directory, one is initialized from scratch. The caller must call the engine's Close method when the engine is no longer needed.

func (*RocksDB) ApplyBatchRepr Uses

func (r *RocksDB) ApplyBatchRepr(repr []byte, sync bool) error

ApplyBatchRepr atomically applies a set of batched updates. Created by calling Repr() on a batch. Using this method is equivalent to constructing and committing a batch whose Repr() equals repr.

It is safe to modify the contents of the arguments after ApplyBatchRepr returns.

func (*RocksDB) ApproximateDiskBytes Uses

func (r *RocksDB) ApproximateDiskBytes(from, to roachpb.Key) (uint64, error)

ApproximateDiskBytes returns the approximate on-disk size of the specified key range.

func (*RocksDB) Attrs Uses

func (r *RocksDB) Attrs() roachpb.Attributes

Attrs returns the list of attributes describing this engine. This may include a specification of disk type (e.g. hdd, ssd, fio, etc.) and potentially other labels to identify important attributes of the engine.

func (*RocksDB) Capacity Uses

func (r *RocksDB) Capacity() (roachpb.StoreCapacity, error)

Capacity queries the underlying file system for disk capacity information.

func (*RocksDB) Clear Uses

func (r *RocksDB) Clear(key MVCCKey) error

Clear removes the item from the db with the given key.

It is safe to modify the contents of the arguments after Clear returns.

func (*RocksDB) ClearIterRange Uses

func (r *RocksDB) ClearIterRange(iter Iterator, start, end MVCCKey) error

ClearIterRange removes a set of entries, from start (inclusive) to end (exclusive).

It is safe to modify the contents of the arguments after ClearIterRange returns.

func (*RocksDB) ClearRange Uses

func (r *RocksDB) ClearRange(start, end MVCCKey) error

ClearRange removes a set of entries, from start (inclusive) to end (exclusive).

It is safe to modify the contents of the arguments after ClearRange returns.

func (*RocksDB) Close Uses

func (r *RocksDB) Close()

Close closes the database by deallocating the underlying handle.

func (*RocksDB) Closed Uses

func (r *RocksDB) Closed() bool

Closed returns true if the engine is closed.

func (*RocksDB) Compact Uses

func (r *RocksDB) Compact() error

Compact forces compaction over the entire database.

func (*RocksDB) CompactRange Uses

func (r *RocksDB) CompactRange(start, end roachpb.Key, forceBottommost bool) error

CompactRange forces compaction over a specified range of keys in the database.

func (*RocksDB) CreateCheckpoint Uses

func (r *RocksDB) CreateCheckpoint(dir string) error

CreateCheckpoint creates a RocksDB checkpoint in the given directory (which must not exist). This directory should be located on the same file system, or copies of all data are used instead of hard links, which is very expensive.

func (*RocksDB) DeleteDirAndFiles Uses

func (r *RocksDB) DeleteDirAndFiles(dir string) error

DeleteDirAndFiles deletes the directory and any files it contains but not subdirectories from this RocksDB's env. If dir does not exist, DeleteDirAndFiles returns nil (no error).

func (*RocksDB) DeleteFile Uses

func (r *RocksDB) DeleteFile(filename string) error

DeleteFile deletes the file with the given filename from this RocksDB's env. If the file with given filename doesn't exist, return os.ErrNotExist.

func (*RocksDB) Flush Uses

func (r *RocksDB) Flush() error

Flush causes RocksDB to write all in-memory data to disk immediately.

func (*RocksDB) Get Uses

func (r *RocksDB) Get(key MVCCKey) ([]byte, error)

Get returns the value for the given key.

func (*RocksDB) GetAuxiliaryDir Uses

func (r *RocksDB) GetAuxiliaryDir() string

GetAuxiliaryDir returns the auxiliary storage path for this engine.

func (*RocksDB) GetCompactionStats Uses

func (r *RocksDB) GetCompactionStats() string

GetCompactionStats returns the internal RocksDB compaction stats. See https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#rocksdb-statistics.

func (*RocksDB) GetEncryptionRegistries Uses

func (r *RocksDB) GetEncryptionRegistries() (*EncryptionRegistries, error)

GetEncryptionRegistries returns the file and key registries when encryption is enabled on the store.

func (*RocksDB) GetEnvStats Uses

func (r *RocksDB) GetEnvStats() (*EnvStats, error)

GetEnvStats returns stats for the RocksDB env. This may include encryption stats.

func (*RocksDB) GetProto Uses

func (r *RocksDB) GetProto(
    key MVCCKey, msg protoutil.Message,
) (ok bool, keyBytes, valBytes int64, err error)

GetProto fetches the value at the specified key and unmarshals it.

func (*RocksDB) GetSSTables Uses

func (r *RocksDB) GetSSTables() SSTableInfos

GetSSTables retrieves metadata about this engine's live sstables.

func (*RocksDB) GetSortedWALFiles Uses

func (r *RocksDB) GetSortedWALFiles() ([]WALFileInfo, error)

GetSortedWALFiles retrievews information about all of the write-ahead log files in this engine in order from oldest to newest.

func (*RocksDB) GetStats Uses

func (r *RocksDB) GetStats() (*Stats, error)

GetStats retrieves stats from this engine's RocksDB instance and returns it in a new instance of Stats.

func (*RocksDB) GetTickersAndHistograms Uses

func (r *RocksDB) GetTickersAndHistograms() (*enginepb.TickersAndHistograms, error)

GetTickersAndHistograms retrieves maps of all RocksDB tickers and histograms. It differs from `GetStats` by getting _every_ ticker and histogram, and by not getting anything else (DB properties, for example).

func (*RocksDB) GetUserProperties Uses

func (r *RocksDB) GetUserProperties() (enginepb.SSTUserPropertiesCollection, error)

GetUserProperties fetches the user properties stored in each sstable's metadata.

func (*RocksDB) IngestExternalFiles Uses

func (r *RocksDB) IngestExternalFiles(
    ctx context.Context, paths []string, skipWritingSeqNo, allowFileModifications bool,
) error

IngestExternalFiles atomically links a slice of files into the RocksDB log-structured merge-tree.

func (*RocksDB) Iterate Uses

func (r *RocksDB) Iterate(start, end MVCCKey, f func(MVCCKeyValue) (bool, error)) error

Iterate iterates from start to end keys, invoking f on each key/value pair. See engine.Iterate for details.

func (*RocksDB) LinkFile Uses

func (r *RocksDB) LinkFile(oldname, newname string) error

LinkFile creates 'newname' as a hard link to 'oldname'. This use the Env responsible for the file which may handle extra logic (eg: copy encryption settings for EncryptedEnv).

func (*RocksDB) LogData Uses

func (r *RocksDB) LogData(data []byte) error

LogData is part of the Writer interface.

It is safe to modify the contents of the arguments after LogData returns.

func (*RocksDB) LogLogicalOp Uses

func (r *RocksDB) LogLogicalOp(op MVCCLogicalOpType, details MVCCLogicalOpDetails)

LogLogicalOp is part of the Writer interface.

func (*RocksDB) Merge Uses

func (r *RocksDB) Merge(key MVCCKey, value []byte) error

Merge implements the RocksDB merge operator using the function goMergeInit to initialize missing values and goMerge to merge the old and the given value into a new value, which is then stored under key. Currently 64-bit counter logic is implemented. See the documentation of goMerge and goMergeInit for details.

It is safe to modify the contents of the arguments after Merge returns.

func (*RocksDB) NewBatch Uses

func (r *RocksDB) NewBatch() Batch

NewBatch returns a new batch wrapping this rocksdb engine.

func (*RocksDB) NewIterator Uses

func (r *RocksDB) NewIterator(opts IterOptions) Iterator

NewIterator returns an iterator over this rocksdb engine.

func (*RocksDB) NewReadOnly Uses

func (r *RocksDB) NewReadOnly() ReadWriter

NewReadOnly returns a new ReadWriter wrapping this rocksdb engine.

func (*RocksDB) NewSnapshot Uses

func (r *RocksDB) NewSnapshot() Reader

NewSnapshot creates a snapshot handle from engine and returns a read-only rocksDBSnapshot engine.

func (*RocksDB) NewWriteOnlyBatch Uses

func (r *RocksDB) NewWriteOnlyBatch() Batch

NewWriteOnlyBatch returns a new write-only batch wrapping this rocksdb engine.

func (*RocksDB) OpenFile Uses

func (r *RocksDB) OpenFile(filename string) (DBFile, error)

OpenFile opens a DBFile, which is essentially a rocksdb WritableFile with the given filename, in this RocksDB's env.

func (*RocksDB) PreIngestDelay Uses

func (r *RocksDB) PreIngestDelay(ctx context.Context)

PreIngestDelay may choose to block for some duration if L0 has an excessive number of files in it or if PendingCompactionBytesEstimate is elevated. This it is intended to be called before ingesting a new SST, since we'd rather backpressure the bulk operation adding SSTs than slow down the whole RocksDB instance and impact all forground traffic by adding too many files to it. After the number of L0 files exceeds the configured limit, it gradually begins delaying more for each additional file in L0 over the limit until hitting its configured (via settings) maximum delay. If the pending compaction limit is exceeded, it waits for the maximum delay.

func (*RocksDB) Put Uses

func (r *RocksDB) Put(key MVCCKey, value []byte) error

Put sets the given key to the value provided.

It is safe to modify the contents of the arguments after Put returns.

func (*RocksDB) ReadFile Uses

func (r *RocksDB) ReadFile(filename string) ([]byte, error)

ReadFile reads the content from a file with the given filename. The file must have been opened through Engine.OpenFile. Otherwise an error will be returned.

func (*RocksDB) SingleClear Uses

func (r *RocksDB) SingleClear(key MVCCKey) error

SingleClear removes the most recent item from the db with the given key.

It is safe to modify the contents of the arguments after SingleClear returns.

func (*RocksDB) String Uses

func (r *RocksDB) String() string

String formatter.

func (*RocksDB) WriteFile Uses

func (r *RocksDB) WriteFile(filename string, data []byte) error

WriteFile writes data to a file in this RocksDB's env.

type RocksDBBatchBuilder Uses

type RocksDBBatchBuilder struct {
    // contains filtered or unexported fields
}

RocksDBBatchBuilder is used to construct the RocksDB batch representation. From the RocksDB code, the representation of a batch is:

WriteBatch::rep_ :=
   sequence: fixed64
   count: fixed32
   data: record[count]
record :=
   kTypeValue varstring varstring
   kTypeDeletion varstring
   [...] (see BatchType)
varstring :=
   len: varint32
   data: uint8[len]

The RocksDBBatchBuilder code currently only supports kTypeValue (BatchTypeValue), kTypeDeletion (BatchTypeDeletion), kTypeMerge (BatchTypeMerge), and kTypeSingleDeletion (BatchTypeSingleDeletion) operations. Before a batch is written to the RocksDB write-ahead-log, the sequence number is 0. The "fixed32" format is little endian.

The keys encoded into the batch are MVCC keys: a string key with a timestamp suffix. MVCC keys are encoded as:

<key>[<wall_time>[<logical>]]<#timestamp-bytes>

The <wall_time> and <logical> portions of the key are encoded as 64 and 32-bit big-endian integers. A custom RocksDB comparator is used to maintain the desired ordering as these keys do not sort lexicographically correctly. Note that the encoding of these keys needs to match up with the encoding in rocksdb/db.cc:EncodeKey().

func (*RocksDBBatchBuilder) ApplyRepr Uses

func (b *RocksDBBatchBuilder) ApplyRepr(repr []byte) error

ApplyRepr applies the mutations in repr to the current batch.

It is safe to modify the contents of the arguments after ApplyRepr returns.

func (*RocksDBBatchBuilder) Clear Uses

func (b *RocksDBBatchBuilder) Clear(key MVCCKey)

Clear removes the item from the db with the given key.

It is safe to modify the contents of the arguments after Clear returns.

func (*RocksDBBatchBuilder) Count Uses

func (b *RocksDBBatchBuilder) Count() uint32

Count returns the count of memtable-modifying operations in this batch.

func (*RocksDBBatchBuilder) Finish Uses

func (b *RocksDBBatchBuilder) Finish() []byte

Finish returns the constructed batch representation. After calling Finish, the builder may be used to construct another batch, but the returned []byte is only valid until the next builder method is called.

func (*RocksDBBatchBuilder) Len Uses

func (b *RocksDBBatchBuilder) Len() int

Len returns the number of bytes currently in the under construction repr.

func (*RocksDBBatchBuilder) LogData Uses

func (b *RocksDBBatchBuilder) LogData(data []byte)

LogData adds a blob of log data to the batch. It will be written to the WAL, but otherwise uninterpreted by RocksDB.

It is safe to modify the contents of the arguments after LogData returns.

func (*RocksDBBatchBuilder) Merge Uses

func (b *RocksDBBatchBuilder) Merge(key MVCCKey, value []byte)

Merge is a high-performance write operation used for values which are accumulated over several writes. Multiple values can be merged sequentially into a single key; a subsequent read will return a "merged" value which is computed from the original merged values.

It is safe to modify the contents of the arguments after Merge returns.

func (*RocksDBBatchBuilder) Put Uses

func (b *RocksDBBatchBuilder) Put(key MVCCKey, value []byte)

Put sets the given key to the value provided.

It is safe to modify the contents of the arguments after Put returns.

func (*RocksDBBatchBuilder) SingleClear Uses

func (b *RocksDBBatchBuilder) SingleClear(key MVCCKey)

SingleClear removes the most recent item from the db with the given key.

It is safe to modify the contents of the arguments after SingleClear returns.

type RocksDBBatchReader Uses

type RocksDBBatchReader struct {
    // contains filtered or unexported fields
}

RocksDBBatchReader is used to iterate the entries in a RocksDB batch representation.

Example: r, err := NewRocksDBBatchReader(...) if err != nil {

return err

} for r.Next() {

	 switch r.BatchType() {
	 case BatchTypeDeletion:
	   fmt.Printf("delete(%x)", r.Key())
	 case BatchTypeValue:
	   fmt.Printf("put(%x,%x)", r.Key(), r.Value())
	 case BatchTypeMerge:
	   fmt.Printf("merge(%x,%x)", r.Key(), r.Value())
  case BatchTypeSingleDeletion:
	   fmt.Printf("single_delete(%x)", r.Key())
  case BatchTypeRangeDeletion:
	   fmt.Printf("delete_range(%x,%x)", r.Key(), r.Value())
	 }

} if err := r.Error(); err != nil {

return err

}

func NewRocksDBBatchReader Uses

func NewRocksDBBatchReader(repr []byte) (*RocksDBBatchReader, error)

NewRocksDBBatchReader creates a RocksDBBatchReader from the given repr and verifies the header.

func (*RocksDBBatchReader) BatchType Uses

func (r *RocksDBBatchReader) BatchType() BatchType

BatchType returns the type of the current batch entry.

func (*RocksDBBatchReader) Count Uses

func (r *RocksDBBatchReader) Count() int

Count returns the declared number of entries in the batch.

func (*RocksDBBatchReader) Error Uses

func (r *RocksDBBatchReader) Error() error

Error returns the error, if any, which the iterator encountered.

func (*RocksDBBatchReader) Key Uses

func (r *RocksDBBatchReader) Key() []byte

Key returns the key of the current batch entry.

func (*RocksDBBatchReader) MVCCEndKey Uses

func (r *RocksDBBatchReader) MVCCEndKey() (MVCCKey, error)

MVCCEndKey returns the MVCC end key of the current batch entry.

func (*RocksDBBatchReader) MVCCKey Uses

func (r *RocksDBBatchReader) MVCCKey() (MVCCKey, error)

MVCCKey returns the MVCC key of the current batch entry.

func (*RocksDBBatchReader) Next Uses

func (r *RocksDBBatchReader) Next() bool

Next advances to the next entry in the batch, returning false when the batch is empty.

func (*RocksDBBatchReader) Value Uses

func (r *RocksDBBatchReader) Value() []byte

Value returns the value of the current batch entry. Value panics if the BatchType is BatchTypeDeleted.

type RocksDBCache Uses

type RocksDBCache struct {
    // contains filtered or unexported fields
}

RocksDBCache is a wrapper around C.DBCache

func NewRocksDBCache Uses

func NewRocksDBCache(cacheSize int64) RocksDBCache

NewRocksDBCache creates a new cache of the specified size. Note that the cache is refcounted internally and starts out with a refcount of one (i.e. Release() should be called after having used the cache).

func (RocksDBCache) Release Uses

func (c RocksDBCache) Release()

Release releases the cache. Note that the cache will continue to be used until all of the RocksDB engines it was attached to have been closed, and that RocksDB engines which use it auto-release when they close.

type RocksDBConfig Uses

type RocksDBConfig struct {
    Attrs roachpb.Attributes
    // Dir is the data directory for this store.
    Dir string
    // If true, creating the instance fails if the target directory does not hold
    // an initialized RocksDB instance.
    //
    // Makes no sense for in-memory instances.
    MustExist bool
    // ReadOnly will open the database in read only mode if set to true.
    ReadOnly bool
    // MaxSizeBytes is used for calculating free space and making rebalancing
    // decisions. Zero indicates that there is no maximum size.
    MaxSizeBytes int64
    // MaxOpenFiles controls the maximum number of file descriptors RocksDB
    // creates. If MaxOpenFiles is zero, this is set to DefaultMaxOpenFiles.
    MaxOpenFiles uint64
    // WarnLargeBatchThreshold controls if a log message is printed when a
    // WriteBatch takes longer than WarnLargeBatchThreshold. If it is set to
    // zero, no log messages are ever printed.
    WarnLargeBatchThreshold time.Duration
    // Settings instance for cluster-wide knobs.
    Settings *cluster.Settings
    // UseFileRegistry is true if the file registry is needed (eg: encryption-at-rest).
    // This may force the store version to versionFileRegistry if currently lower.
    UseFileRegistry bool
    // RocksDBOptions contains RocksDB specific options using a semicolon
    // separated key-value syntax ("key1=value1; key2=value2").
    RocksDBOptions string
    // ExtraOptions is a serialized protobuf set by Go CCL code and passed through
    // to C CCL code.
    ExtraOptions []byte
}

RocksDBConfig holds all configuration parameters and knobs used in setting up a new RocksDB instance.

type RocksDBError Uses

type RocksDBError struct {
    // contains filtered or unexported fields
}

A RocksDBError wraps an error returned from a RocksDB operation.

func (*RocksDBError) Error Uses

func (err *RocksDBError) Error() string

Error implements the error interface.

func (RocksDBError) SafeMessage Uses

func (err RocksDBError) SafeMessage() string

SafeMessage implements log.SafeMessager. RocksDB errors are not very well-structured and we additionally only pass a stringified representation from C++ to Go. The error usually takes the form "<typeStr>: [<subtypeStr>] <msg>" where `<typeStr>` is generated from an enum and <subtypeStr> is rarely used. <msg> usually contains the bulk of information and follows no particular rules.

To extract safe messages from these errors, we keep a dictionary generated from the RocksDB source code and report verbatim all words from the dictionary (masking out the rest which in particular includes paths).

The originating RocksDB error type is defined in c-deps/rocksdb/util/status.cc.

type RocksDBSstFileReader Uses

type RocksDBSstFileReader struct {
    // contains filtered or unexported fields
}

RocksDBSstFileReader allows iteration over a number of non-overlapping sstables exported by `RocksDBSstFileWriter`.

func MakeRocksDBSstFileReader Uses

func MakeRocksDBSstFileReader() RocksDBSstFileReader

MakeRocksDBSstFileReader creates a RocksDBSstFileReader backed by an in-memory RocksDB instance.

func (*RocksDBSstFileReader) Close Uses

func (fr *RocksDBSstFileReader) Close()

Close finishes the reader.

func (*RocksDBSstFileReader) IngestExternalFile Uses

func (fr *RocksDBSstFileReader) IngestExternalFile(data []byte) error

IngestExternalFile links a file with the given contents into a database. See the RocksDB documentation on `IngestExternalFile` for the various restrictions on what can be added.

func (*RocksDBSstFileReader) Iterate Uses

func (fr *RocksDBSstFileReader) Iterate(
    start, end MVCCKey, f func(MVCCKeyValue) (bool, error),
) error

Iterate iterates over the keys between start inclusive and end exclusive, invoking f() on each key/value pair.

func (*RocksDBSstFileReader) NewIterator Uses

func (fr *RocksDBSstFileReader) NewIterator(opts IterOptions) Iterator

NewIterator returns an iterator over this sst reader.

type RocksDBSstFileWriter Uses

type RocksDBSstFileWriter struct {

    // DataSize tracks the total key and value bytes added so far.
    DataSize int64
    // contains filtered or unexported fields
}

RocksDBSstFileWriter creates a file suitable for importing with RocksDBSstFileReader. It implements the Writer interface.

func MakeRocksDBSstFileWriter Uses

func MakeRocksDBSstFileWriter() (RocksDBSstFileWriter, error)

MakeRocksDBSstFileWriter creates a new RocksDBSstFileWriter with the default configuration.

func (*RocksDBSstFileWriter) ApplyBatchRepr Uses

func (fw *RocksDBSstFileWriter) ApplyBatchRepr(repr []byte, sync bool) error

ApplyBatchRepr implements the Writer interface.

func (*RocksDBSstFileWriter) Clear Uses

func (fw *RocksDBSstFileWriter) Clear(key MVCCKey) error

Clear implements the Writer interface. Note that it inserts a tombstone rather than actually remove the entry from the storage engine. An error is returned if it is not greater than any previous key used in Put or Clear (according to the comparator configured during writer creation). Close cannot have been called.

func (*RocksDBSstFileWriter) ClearIterRange Uses

func (fw *RocksDBSstFileWriter) ClearIterRange(iter Iterator, start, end MVCCKey) error

ClearIterRange implements the Writer interface.

NOTE: This method is fairly expensive as it performs a Cgo call for every key deleted.

func (*RocksDBSstFileWriter) ClearRange Uses

func (fw *RocksDBSstFileWriter) ClearRange(start, end MVCCKey) error

ClearRange implements the Writer interface. Note that it inserts a range deletion tombstone rather than actually remove the entries from the storage engine. It can be called at any time with respect to Put and Clear.

func (*RocksDBSstFileWriter) Close Uses

func (fw *RocksDBSstFileWriter) Close()

Close finishes and frees memory and other resources. Close is idempotent.

func (*RocksDBSstFileWriter) Finish Uses

func (fw *RocksDBSstFileWriter) Finish() ([]byte, error)

Finish finalizes the writer and returns the constructed file's contents. At least one kv entry must have been added.

func (*RocksDBSstFileWriter) LogData Uses

func (fw *RocksDBSstFileWriter) LogData(data []byte) error

LogData implements the Writer interface.

func (*RocksDBSstFileWriter) LogLogicalOp Uses

func (fw *RocksDBSstFileWriter) LogLogicalOp(op MVCCLogicalOpType, details MVCCLogicalOpDetails)

LogLogicalOp implements the Writer interface.

func (*RocksDBSstFileWriter) Merge Uses

func (fw *RocksDBSstFileWriter) Merge(key MVCCKey, value []byte) error

Merge implements the Writer interface.

func (*RocksDBSstFileWriter) Put Uses

func (fw *RocksDBSstFileWriter) Put(key MVCCKey, value []byte) error

Put implements the Writer interface. It puts a kv entry into the sstable being built. An error is returned if it is not greater than any previous key used in Put or Clear (according to the comparator configured during writer creation). Close cannot have been called.

func (*RocksDBSstFileWriter) SingleClear Uses

func (fw *RocksDBSstFileWriter) SingleClear(key MVCCKey) error

SingleClear implements the Writer interface.

func (*RocksDBSstFileWriter) Truncate Uses

func (fw *RocksDBSstFileWriter) Truncate() ([]byte, error)

Truncate truncates the writer's current memory buffer and returns the contents it contained. May be called multiple times. The function may not truncate and return all keys if the underlying RocksDB blocks have not been flushed. Close cannot have been called.

type SSTableInfo Uses

type SSTableInfo struct {
    Level int
    Size  int64
    Start MVCCKey
    End   MVCCKey
}

SSTableInfo contains metadata about a single sstable. Note this mirrors the C.DBSSTable struct contents.

type SSTableInfos Uses

type SSTableInfos []SSTableInfo

SSTableInfos is a slice of SSTableInfo structures.

func (SSTableInfos) Len Uses

func (s SSTableInfos) Len() int

func (SSTableInfos) Less Uses

func (s SSTableInfos) Less(i, j int) bool

func (SSTableInfos) ReadAmplification Uses

func (s SSTableInfos) ReadAmplification() int

ReadAmplification returns RocksDB's worst case read amplification, which is the number of level-0 sstables plus the number of levels, other than level 0, with at least one sstable.

This definition comes from here: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#level-style-compaction

func (SSTableInfos) String Uses

func (s SSTableInfos) String() string

func (SSTableInfos) Swap Uses

func (s SSTableInfos) Swap(i, j int)

type SSTableInfosByLevel Uses

type SSTableInfosByLevel struct {
    // contains filtered or unexported fields
}

SSTableInfosByLevel maintains slices of SSTableInfo objects, one per level. The slice for each level contains the SSTableInfo objects for SSTables at that level, sorted by start key.

func NewSSTableInfosByLevel Uses

func NewSSTableInfosByLevel(s SSTableInfos) SSTableInfosByLevel

NewSSTableInfosByLevel returns a new SSTableInfosByLevel object based on the supplied SSTableInfos slice.

func (*SSTableInfosByLevel) MaxLevel Uses

func (s *SSTableInfosByLevel) MaxLevel() int

MaxLevel returns the maximum level for which there are SSTables.

func (*SSTableInfosByLevel) MaxLevelSpanOverlapsContiguousSSTables Uses

func (s *SSTableInfosByLevel) MaxLevelSpanOverlapsContiguousSSTables(span roachpb.Span) int

MaxLevelSpanOverlapsContiguousSSTables returns the maximum level at which the specified key span overlaps either none, one, or at most two contiguous SSTables. Level 0 is returned if no level qualifies.

This is useful when considering when to merge two compactions. In this case, the method is called with the "gap" between the two spans to be compacted. When the result is that the gap span touches at most two SSTables at a high level, it suggests that merging the two compactions is a good idea (as the up to two SSTables touched by the gap span, due to containing endpoints of the existing compactions, would be rewritten anyway).

As an example, consider the following sstables in a small database:

Level 0.

{Level: 0, Size: 20, Start: key("a"), End: key("z")},
{Level: 0, Size: 15, Start: key("a"), End: key("k")},

Level 2.

{Level: 2, Size: 200, Start: key("a"), End: key("j")},
{Level: 2, Size: 100, Start: key("k"), End: key("o")},
{Level: 2, Size: 100, Start: key("r"), End: key("t")},

Level 6.

{Level: 6, Size: 201, Start: key("a"), End: key("c")},
{Level: 6, Size: 200, Start: key("d"), End: key("f")},
{Level: 6, Size: 300, Start: key("h"), End: key("r")},
{Level: 6, Size: 405, Start: key("s"), End: key("z")},

- The span "a"-"c" overlaps only a single SSTable at the max level

(L6). That's great, so we definitely want to compact that.

- The span "s"-"t" overlaps zero SSTables at the max level (L6).

Again, great! That means we're going to compact the 3rd L2
SSTable and maybe push that directly to L6.

type SimpleIterator Uses

type SimpleIterator interface {
    // Close frees up resources held by the iterator.
    Close()
    // Seek advances the iterator to the first key in the engine which
    // is >= the provided key.
    Seek(key MVCCKey)
    // Valid must be called after any call to Seek(), Next(), Prev(), or
    // similar methods. It returns (true, nil) if the iterator points to
    // a valid key (it is undefined to call Key(), Value(), or similar
    // methods unless Valid() has returned (true, nil)). It returns
    // (false, nil) if the iterator has moved past the end of the valid
    // range, or (false, err) if an error has occurred. Valid() will
    // never return true with a non-nil error.
    Valid() (bool, error)
    // Next advances the iterator to the next key/value in the
    // iteration. After this call, Valid() will be true if the
    // iterator was not positioned at the last key.
    Next()
    // NextKey advances the iterator to the next MVCC key. This operation is
    // distinct from Next which advances to the next version of the current key
    // or the next key if the iterator is currently located at the last version
    // for a key.
    NextKey()
    // UnsafeKey returns the same value as Key, but the memory is invalidated on
    // the next call to {Next,Prev,Seek,SeekReverse,Close}.
    UnsafeKey() MVCCKey
    // UnsafeValue returns the same value as Value, but the memory is
    // invalidated on the next call to {Next,Prev,Seek,SeekReverse,Close}.
    UnsafeValue() []byte
}

SimpleIterator is an interface for iterating over key/value pairs in an engine. SimpleIterator implementations are thread safe unless otherwise noted. SimpleIterator is a subset of the functionality offered by Iterator.

func MakeMultiIterator Uses

func MakeMultiIterator(iters []SimpleIterator) SimpleIterator

MakeMultiIterator creates an iterator that multiplexes SimpleIterators. The caller is responsible for closing the passed iterators after closing the returned multiIterator.

If two iterators have an entry with exactly the same key and timestamp, the one with a higher index in this constructor arg is preferred. The other is skipped.

func NewMemSSTIterator Uses

func NewMemSSTIterator(data []byte, verify bool) (SimpleIterator, error)

NewMemSSTIterator returns a SimpleIterator for a leveldb format sstable in memory. It's compatible with sstables output by RocksDBSstFileWriter, which means the keys are CockroachDB mvcc keys and they each have the RocksDB trailer (of seqno & value type).

func NewSSTIterator Uses

func NewSSTIterator(path string) (SimpleIterator, error)

NewSSTIterator returns a SimpleIterator for a leveldb formatted sstable on disk. It's compatible with sstables output by RocksDBSstFileWriter, which means the keys are CockroachDB mvcc keys and they each have the RocksDB trailer (of seqno & value type).

type Stats Uses

type Stats struct {
    BlockCacheHits                 int64
    BlockCacheMisses               int64
    BlockCacheUsage                int64
    BlockCachePinnedUsage          int64
    BloomFilterPrefixChecked       int64
    BloomFilterPrefixUseful        int64
    MemtableTotalSize              int64
    Flushes                        int64
    Compactions                    int64
    TableReadersMemEstimate        int64
    PendingCompactionBytesEstimate int64
    L0FileCount                    int64
}

Stats is a set of RocksDB stats. These are all described in RocksDB

Currently, we collect stats from the following sources: 1. RocksDB's internal "tickers" (i.e. counters). They're defined in

rocksdb/statistics.h

2. DBEventListener, which implements RocksDB's EventListener interface. 3. rocksdb::DB::GetProperty().

This is a good resource describing RocksDB's memory-related stats: https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB

type Version Uses

type Version struct {
    Version storageVersion
}

Version stores all the version information for all stores and is used as the format for the version file.

type WALFileInfo Uses

type WALFileInfo struct {
    LogNumber int64
    Size      int64
}

WALFileInfo contains metadata about a single write-ahead log file. Note this mirrors the C.DBWALFile struct.

type WithSSTables Uses

type WithSSTables interface {
    Engine
    // GetSSTables retrieves metadata about this engine's live sstables.
    GetSSTables() SSTableInfos
}

WithSSTables extends the Engine interface with a method to get info on all SSTables in use.

type Writer Uses

type Writer interface {
    // ApplyBatchRepr atomically applies a set of batched updates. Created by
    // calling Repr() on a batch. Using this method is equivalent to constructing
    // and committing a batch whose Repr() equals repr. If sync is true, the
    // batch is synchronously written to disk. It is an error to specify
    // sync=true if the Writer is a Batch.
    //
    // It is safe to modify the contents of the arguments after ApplyBatchRepr
    // returns.
    ApplyBatchRepr(repr []byte, sync bool) error
    // Clear removes the item from the db with the given key. Note that clear
    // actually removes entries from the storage engine, rather than inserting
    // tombstones.
    //
    // It is safe to modify the contents of the arguments after Clear returns.
    Clear(key MVCCKey) error
    // SingleClear removes the most recent write to the item from the db with
    // the given key. Whether older version of the item will come back to life
    // if not also removed with SingleClear is undefined. See the following:
    //   https://github.com/facebook/rocksdb/wiki/Single-Delete
    // for details on the SingleDelete operation that this method invokes. Note
    // that clear actually removes entries from the storage engine, rather than
    // inserting tombstones.
    //
    // It is safe to modify the contents of the arguments after SingleClear
    // returns.
    SingleClear(key MVCCKey) error
    // ClearRange removes a set of entries, from start (inclusive) to end
    // (exclusive). Similar to Clear, this method actually removes entries from
    // the storage engine.
    //
    // Note that when used on batches, subsequent reads may not reflect the result
    // of the ClearRange.
    //
    // It is safe to modify the contents of the arguments after ClearRange
    // returns.
    ClearRange(start, end MVCCKey) error
    // ClearIterRange removes a set of entries, from start (inclusive) to end
    // (exclusive). Similar to Clear and ClearRange, this method actually removes
    // entries from the storage engine. Unlike ClearRange, the entries to remove
    // are determined by iterating over iter and per-key tombstones are
    // generated.
    //
    // It is safe to modify the contents of the arguments after ClearIterRange
    // returns.
    ClearIterRange(iter Iterator, start, end MVCCKey) error
    // Merge is a high-performance write operation used for values which are
    // accumulated over several writes. Multiple values can be merged
    // sequentially into a single key; a subsequent read will return a "merged"
    // value which is computed from the original merged values.
    //
    // Merge currently provides specialized behavior for three data types:
    // integers, byte slices, and time series observations. Merged integers are
    // summed, acting as a high-performance accumulator.  Byte slices are simply
    // concatenated in the order they are merged. Time series observations
    // (stored as byte slices with a special tag on the roachpb.Value) are
    // combined with specialized logic beyond that of simple byte slices.
    //
    // The logic for merges is written in db.cc in order to be compatible with
    // RocksDB.
    //
    // It is safe to modify the contents of the arguments after Merge returns.
    Merge(key MVCCKey, value []byte) error
    // Put sets the given key to the value provided.
    //
    // It is safe to modify the contents of the arguments after Put returns.
    Put(key MVCCKey, value []byte) error
    // LogData adds the specified data to the RocksDB WAL. The data is
    // uninterpreted by RocksDB (i.e. not added to the memtable or sstables).
    //
    // It is safe to modify the contents of the arguments after LogData returns.
    LogData(data []byte) error
    // LogLogicalOp logs the specified logical mvcc operation with the provided
    // details to the writer, if it has logical op logging enabled. For most
    // Writer implementations, this is a no-op.
    LogLogicalOp(op MVCCLogicalOpType, details MVCCLogicalOpDetails)
}

Writer is the write interface to an engine's data.

Directories

PathSynopsis
enginepb

Package engine imports 49 packages (graph) and is imported by 49 packages. Updated 2019-09-18. Refresh now. Tools for package owners.