graphsplit

package module
v0.5.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 7, 2022 License: MIT Imports: 37 Imported by: 0

README

Go-graphsplit

A tool for splitting a large dataset into graph slices to make deals in the Filecoin Network

When storing a large dataset, we need to split it into smaller pieces to fit the sector's size, which could generally be 32GiB or 64GiB.

If we make these data into a large tarball, chunk it into small pieces, and then make storage deals with miners with these pieces, on the side of storage, it will be pretty efficient and allow us to store hundreds of TiB data in a month. However, this way will also bring difficulties for data retrieval. Even if we only needed to retrieve a small file, we would first have to retrieve and download all the pieces of this tarball, decompress it, and find the specific file we needed.

Graphsplit can solve this problem. It takes advantage of IPLD protocol, follows the Unixfs format data structures, and regards the dataset or its sub-directory as a big graph, then cuts it into small graphs. Each small graph will keep its file system structure as possible as it used to be. After that, we only need to organize these small graphs into a car file. If one data piece has a complete file and we need to retrieve it, we only need to use payload CID to retrieve it through the lotus client, fetch it back, and get the file. Besides, Graphsplit will create a manifest.csv to save the mapping with graph slice name, payload CID, Piece CID, and the inner file structure.

Another advantage of Graphsplit is it can perfectly match IPFS. Like if you build an IPFS website as your Deal UI website, the inner file structure of each data piece can be shown on it, and it is easier for users to retrieve and download the data they stored.

Build

git clone https://github.com/filedrive-team/go-graphsplit.git

cd go-graphsplit

# get submodules
git submodule update --init --recursive

# build filecoin-ffi
make ffi

make

Usage

See the work flow of graphsplit

Splitting dataset:

# car-dir: folder for splitted smaller pieces, in form of .car
# slice-size: size for each pieces
# parallel: number goroutines run when building ipld nodes
# graph-name: it will use graph-name for prefix of smaller pieces
# calc-commp: calculation of pieceCID, default value is false. Be careful, a lot of cpu, memory and time would be consumed if slice size is very large.
# parent-path: usually just be the same as /path/to/dataset, it's just a method to figure out relative path when building IPLD graph
./graphsplit chunk \
--car-dir=path/to/car-dir \
--slice-size=17179869184 \
--parallel=2 \
--graph-name=gs-test \
--calc-commp=false \
--parent-path=/path/to/dataset \
/path/to/dataset

Notes: A manifest.csv will created to save the mapping with graph slice name, the payload cid and slice inner structure. As following:

cat /path/to/car-dir/manifest.csv
payload_cid,filename,detail
Qm...,graph-slice-name.car,inner-structure-json

If set --calc-commp=true, two another fields would be add to manifest.csv

cat /path/to/car-dir/manifest.csv
payload_cid,filename,piece_cid,piece_size,detail
Qm...,graph-slice-name.car,baga...,16646144,inner-structure-json

Import car file to IPFS:

ipfs dag import /path/to/car-dir/car-file

Restore files:

# car-path: directory or file, in form of .car
# output-dir: usually just be the same as /path/to/output-dir
# parallel: number goroutines run when restoring
./graphsplit restore \
--car-path=/path/to/car-path \
--output-dir=/path/to/output-dir \
--parallel=2

PieceCID Calculation for a single car file:

# Calculate pieceCID for a single car file
# 
./graphsplit commP /path/to/carfile

Contribute

PRs are welcome!

License

MIT

Documentation

Index

Constants

View Source
const UnixfsChunkSize uint64 = 1 << 20
View Source
const UnixfsLinksPerLevel = 1 << 10

Variables

This section is empty.

Functions

func BuildFileNode

func BuildFileNode(item Finfo, bufDs ipld.DAGService, cidBuilder cid.Builder) (node ipld.Node, err error)

func BuildIpldGraph

func BuildIpldGraph(ctx context.Context, fileList []Finfo, graphName, parentPath, carDir string, parallel int, cb GraphBuildCallback)

func CarTo

func CarTo(carPath, outputDir string, parallel int)

func Chunk

func Chunk(ctx context.Context, sliceSize int64, parentPath, targetPath, carDir, graphName string, parallel int, cb GraphBuildCallback) error

func ExistDir

func ExistDir(path string) bool

func GenGraphName

func GenGraphName(graphName string, sliceCount, sliceTotal int) string

func GetFileList

func GetFileList(args []string) (fileList []string, err error)

func GetFileListAsync

func GetFileListAsync(args []string) chan Finfo

func GetGraphCount

func GetGraphCount(args []string, sliceSize int64) int

func Import

func Import(ctx context.Context, path string, st car.Store) (cid.Cid, error)

func Merge

func Merge(dir string, parallel int)

func NodeWriteTo

func NodeWriteTo(nd files.Node, fpath string) error

Types

type CommPRet

type CommPRet struct {
	Root cid.Cid
	Size abi.UnpaddedPieceSize
}

type FSBuilder

type FSBuilder struct {
	// contains filtered or unexported fields
}

func NewFSBuilder

func NewFSBuilder(root *dag.ProtoNode, ds ipld.DAGService) *FSBuilder

func (*FSBuilder) Build

func (b *FSBuilder) Build() (*fsNode, error)

type Finfo

type Finfo struct {
	Path      string
	Name      string
	Info      os.FileInfo
	SeekStart int64
	SeekEnd   int64
}

type GraphBuildCallback

type GraphBuildCallback interface {
	OnSuccess(node ipld.Node, graphName, fsDetail string)
	OnError(error)
}

func CSVCallback

func CSVCallback(carDir string) GraphBuildCallback

func CommPCallback

func CommPCallback(carDir string) GraphBuildCallback

func ErrCallback

func ErrCallback() GraphBuildCallback

type Manifest

type Manifest struct {
	PayloadCid string `csv:"payload_cid"`
	Filename   string `csv:"filename"`
}

manifest

type PieceInfo

type PieceInfo struct {
	PayloadCid string `csv:"payload_cid"`
	Filename   string `csv:"filename"`
	PieceCid   string `csv:"piece_cid"`
	PieceSize  uint64 `csv:"piece_size"`
}

piece info

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL