dagaggregator

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 29, 2021 License: Apache-2.0, MIT Imports: 14 Imported by: 8

README

go-dagaggregator-unixfs

A stateless aggregator for organizing arbitrary IPLD dags within a UnixFS hierarchy

GoDoc GoReport

This library provides functions and convention for aggregating multiple arbitrary DAGs into a single superstructure, while preserving sufficient metadata for basic navigation with pathing IPLD selectors.

Typical use case

Users who want to store relatively small ( below about 8GiB ) DAGs on Filecoin often find it difficult to have their deal accepted, even in the presence of Fil+. A solution to this is grouping the root CIDs of multiple non-related DAGs into an "aggregate structure" which is then used to make a deal with a specific miner. Naturally the resulting structure will have a new root CID, which complicates both discovery and retrieval. Additionally certain limits need to be respected otherwise such a structure can become unwieldy in other IPLD contexts like IPFS.

This library provides basic solutions to the above problems.

Spec

The test-fixture demonstrating everything below can be found at https://dweb.link/ipfs/bafybeib62b4ukyzjcj7d2h4mbzjgg7l6qiz3ma4vb4b2bawmcauf5afvua

Grouping UnixFS structure

A "dag aggregation" UnixFS directory has the following structure:

  • The first entry is a manifest file in ND-JSON format (detailed in next section)
  • Every CID is represented as base32 CIDv1. Any Qm... CIDv0, is upgraded to CIDv1 as this operation is lossless.
  • For browseability/recognizeability purposes, and to stay within bitswap limits, the leading 3 characters of a CIDv1 are combined with the 2 and 4 trailing characters for each directory sublevel, based on the calculations in the Limits section below.

This means a directory looking roughly like:

- bafyAggregateRootCid
  - @AggregateManifest.ndjson
  - baf...aa
    - baf...aaaa
    - baf...abaa
    …
    - baf...77aa
  - baf...ab
    - baf...aaab
    …
    - baf...77ab
  …
  - baf...77
    - baf...7777
        - bafymbzacid777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777

Manifest format

The Aggregate Manifest is an ND-JSON file comprised of 3 types of records.

Preamble

This is always the first record of the manifest, it signals how to parse the rest of the manifest.

{
  "RecordType": "DagAggregatePreamble",
  "Version": 1
}
Summary

This is the second record within the manifest. Has entry count and various other metadata.

{
  "RecordType": "DagAggregateSummary",
  "EntryCount": 4,
  "EntriesSortedBy": "DagCidV1",
  "Description": "Aggregate of non-related DAGs, produced by github.com/filecoin-project/go-dagaggregator-unixfs"
}
Individual DAG Entries

The rest of the manifest records contain information about the included DAGs, one per record.

{
  "RecordType": "DagAggregateEntry",
  "DagCidV1": "bafybeibhbx3y6tnn7q4gpsous6apnobft5jybvroiepdsmvps2lmycjjxu",
  "DagCidV0": "QmQy6xmJhrcC5QLboAcGFcAE1tC8CrwDVkrHdEYJkLscrQ",
  "DagSize": 42,
  "NodeCount": 1,
  "PathPrefixes": [ "baf...xu", "baf...jjxu" ],
  "PathIndexes": [ 2, 0, 0 ]
}

PathPrefixes contains the parent directories for this particular DAG, based on the chosen parts of the target CID. They currently can be derived from DagCIDV1, but are included for ease of navigation.

PathIndexes contains the 0-based position of each entry within its corresponding parent directory. It is provided to make partial retrievals possible with simple pathing selectors as described in this lotus changeset

Limits in consideration

  • A car file accepted by the Filecoin network currently can have a maximum payload size of about 65,024MiB ( 64<<30 / 128 * 127 ). For the sake of argument let's assume a future with 128GiB sectors, which gives an exact maximum payload size of 136,365,211,648 bytes. Assuming ~60 bytes for a car header, shortest safe CID representation of a CIDv0 sha2-256 at 34 bytes, and payload of 1024 bytes per block ( ridiculously small NFTs ), gives us an upper bound of (136365211648 - 60) / ( 2 + 34 + 1024) ~~ 128,646,426 ~~ an upper bound of 2^27 individual CIDs that could be in a deal and all be individually addressable.

  • The longest textual representation of a "tenuously common" CID would be a b-multibase base32 representation of a blake2b-512, which clocks at 113 bytes. This in turn means that a typical UnixFS directory can contain: 1048576 = ( 4b dir hdr ) + N * ( 2b pbframe hdr + 2b CID prefix + 70b CID + 2b name prefix + 113b text-CID + ~5 bytes for prefixed size ) ~~ 5405 such names without sharding, before going over the 1MiB libp2p limit. In order to be super-conservative assume a target of 2^12 entries per "aggregate shard". If we go with the more reasonable 256-bit hashes, we arrive at ~9363 names which translates to 2^13 entries per shard.

  • The common textual representation of CIDs is base32, each character of which represents exactly 5 bits. This means that sharding on 2 base32 characters gives a rough distribution of 2^10 per shard, fitting comfortably within the above considerations.

The limits described above, combined with the perfect distribution of hashes within CIDs, means that one can safely store a "super-sector" full of addressable CIDs by having 2 layers of directories, the first layer "sharded" by 2 base32 characters, the second layer by 4 base32 characters, and the final container having the full CIDs pointing to the content: 2^(10+10+10) > 2^(27). This does not even take into account the vast overestimation of sector and CID sizes.

Lead Maintainer

Peter Rabbitson

License

SPDX-License-Identifier: Apache-2.0 OR MIT

Documentation

Index

Constants

View Source
const AggregateManifestFilename = `@AggregateManifest.ndjson`

AggregateManifestFilename must be a name go-unixfs will sort first in the final structure. Since ADL-free selectors-over-names are currently difficult instead one can simply say "I want the cid of the first link of the root structure" and be reasonably confident they will get to this file.

Variables

View Source
var CurrentManifestPreamble = ManifestPreamble{
	Version:    1,
	RecordType: DagAggregatePreamble,
}

CurrentManifestPreamble is always encoded as the very first line within the AggregateManifestFilename

Functions

func EncodeManifestJSON added in v0.2.0

func EncodeManifestJSON(aggregateManifestEntries []*ManifestDagEntry, jsonFile io.Writer) error

EncodeManifestJSON turns a set of manifest entries into the final NDJSON file included at the root of the aggregate structure.

Types

type AggregateDagEntry

type AggregateDagEntry struct {
	RootCid                   cid.Cid
	UniqueBlockCount          uint64 // optional amount of blocks in dag, recorded in manifest
	UniqueBlockCumulativeSize uint64 // optional dag size, used as the TSize in the unixfs link entry and recorded in manifest
}

type ManifestDagEntry

type ManifestDagEntry struct {
	RecordType   RecordType
	DagCidV1     string
	DagCidV0     string    `json:",omitempty"`
	DagSize      *uint64   `json:",omitempty"`
	NodeCount    *uint64   `json:",omitempty"`
	PathPrefixes [2]string // not repeating the DagCid as segment#3 - too long
	PathIndexes  [3]int
	// contains filtered or unexported fields
}

func Aggregate

func Aggregate(ctx context.Context, ds ipldformat.DAGService, toAggregate []AggregateDagEntry) (aggregateRoot cid.Cid, aggregateManifestEntries []*ManifestDagEntry, err error)

Aggregate de-duplicates and orders the supplied list of `AggregateDagEntry`-es and adds them into a two-level UnixFSv1 directory structure. The intermediate blocks comprising the directory tree and the manifest json file are written to the supplied DAGService. No "temporary blocks" are produced in the process: everything written to the DAGService is part of the final DAG capped by the final `aggregateRoot`.

type ManifestPreamble

type ManifestPreamble struct {
	RecordType RecordType
	Version    uint32
}

type ManifestSummary

type ManifestSummary struct {
	RecordType      RecordType
	EntryCount      int
	EntriesSortedBy string
	Description     string
}

type RecordType

type RecordType string
const (
	DagAggregatePreamble RecordType = `DagAggregatePreamble`
	DagAggregateSummary  RecordType = `DagAggregateSummary`
	DagAggregateEntry    RecordType = `DagAggregateEntry`
)

Directories

Path Synopsis
cmd
lib
rambs
Package rambs is an implementation of a blockstore, keeping records indexed by full CIDs instead of just multihashes.
Package rambs is an implementation of a blockstore, keeping records indexed by full CIDs instead of just multihashes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL