pamutil

package
v0.0.0-...-d966d87 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 18, 2020 License: Apache-2.0 Imports: 15 Imported by: 1

Documentation

Index

Constants

View Source
const DefaultVersion = "PAM2"

DefaultVersion is the string embedded in ShardIndex.version.

View Source
const ShardIndexMagic = uint64(0x725c7226be794c60)

ShardIndexMagic is the value of ShardIndex.Magic.

Variables

This section is empty.

Functions

func BlockIntersectsRange

func BlockIntersectsRange(startAddr, endAddr biopb.Coord, userRange biopb.CoordRange) bool

BlockIntersectsRange checks if userRange and [startAddr, endAddr] intersect.

func CoordPathString

func CoordPathString(r biopb.Coord) string

CoordPathString generates a string that can be used to embed in a pathname. Use ParsePath() to parse such a string.

func CoordRangePathString

func CoordRangePathString(r biopb.CoordRange) string

CoordRangePathString returns a string that can be used as part of a pathname.

func FieldDataPath

func FieldDataPath(dir string, recRange biopb.CoordRange, field string) string

FieldDataPath returns the path of the file storing data for the given record range and the field.

func GenerateReadShards

func GenerateReadShards(
	opts GenerateReadShardsOpts,
	indexes []ShardIndex) ([]biopb.CoordRange, error)

GenerateReadShards returns a list of biopb.CoordRanges. The biopb.CoordRanges can be passed to NewReader for parallel, sharded record reads. The returned list satisfies the following conditions.

  1. The ranges in the list fill opts.Range (or the UniversalRange if not set) exactly, without an overlap or a gap.
  1. Length of the list is at least nShards. The length may exceed nShards because this function tries to split a range at a rowshard boundary.

3. The bytesize of the file region(s) that covers each biopb.CoordRange is roughly the same.

4. The ranges are sorted in an increasing order of biopb.Coord.

opts.NumShards specifies the number of shards. It should be generally be zero, in which case the function picks an appropriate default.

func NewShardIndex

func NewShardIndex(shardRange biopb.CoordRange, h *sam.Header) biopb.PAMShardIndex

NewShardIndex creates a new PAMShardIndex object with the given arguments.

func ReadShardIndex

func ReadShardIndex(ctx context.Context, dir string, recRange biopb.CoordRange) (index biopb.PAMShardIndex, err error)

ReadShardIndex reads the index file, "dir/<recRange>.index".

func Remove

func Remove(dir string) error

Remove deletes the files in the given PAM directory. It returns an error if some of the existing files fails to delete.

func ShardIndexPath

func ShardIndexPath(dir string, recRange biopb.CoordRange) string

ShardIndexPath returns the path of shard index file.

func ValidateCoordRange

func ValidateCoordRange(r *biopb.CoordRange) error

ValidateCoordRange validates "r" and normalize its fields, if necessary. In particular, if the range fields are all zeros, the range is replaced by UniversalRange.

func WriteShardIndex

func WriteShardIndex(ctx context.Context, dir string, coordRange biopb.CoordRange, msg *biopb.PAMShardIndex) error

WriteShardIndex serializes "msg" into a single-block recordio file "dir/<coordRange>.index". Existing contents of the file is clobbered.

Types

type FileInfo

type FileInfo struct {
	// Path is the value passed to ParsePath.
	Path string

	// FileType is the type of the file. For "dir/0:0,46:1653469.mapq", the type
	// is FileTypeFieldData. For "dir/0:0,46:1653469.mapq", the type is
	// FileTypeFieldIndex.
	Type FileType

	// Field stores the field part of the filename. Field=="mapq" if the pathname
	// is "dir/0:0,46:1653469.mapq". It is meaningful iff Type ==
	// FileTypeFieldData.
	Field string

	// Dir is the directory under which the file is stored. Dir="dir" if the
	// pathname is "dir/0:0,46:1653469.mapq".
	Dir string
	// Range is the record range that the file stores. Range={Start:{0,0},
	// Limit:{46,1653469}} if the pathname is "dir/0:0,46:1653469.mapq".
	Range biopb.CoordRange
}

FileInfo is the result of parsing a pathname.

A PAM pathname looks like "dir/0:0,46:1653469.mapq" or "dir/0:0,46:1653469.index".

func ChooseIndexFilesInRange

func ChooseIndexFilesInRange(allIndexFiles []FileInfo, recRange biopb.CoordRange) ([]FileInfo, error)

ChooseIndexFilesInRange returns the subset of allIndexFiles that overlap recRange. REQUIRES: allIndexFiles[i].Type == FileTypeShardIndex for all i.

func FindIndexFilesInRange

func FindIndexFilesInRange(ctx context.Context, dir string, recRange biopb.CoordRange) ([]FileInfo, error)

FindIndexFilesInRange lists all *.index files that store a record that intersects "recRange".

func ListIndexes

func ListIndexes(ctx context.Context, dir string) ([]FileInfo, error)

ListIndexes lists shard index files found for the given PAM files. The returned list will be sorted by positions.

func ParsePath

func ParsePath(path string) (FileInfo, error)

ParsePath parses a PAM path into constituent parts. For example, ParsePath("foo:0:1,3:4.index") will result in FileInfo{Path: "foo", Type: FileTypeIndex, Prefix: "foo", Range: {biopb.Coord{0,1,0}, biopb.Coord{3,4,0}}}.

type FileType

type FileType int

FileType defines the type of the file, either data or index.

const (
	// FileTypeUnknown is a sentinel
	FileTypeUnknown FileType = iota
	// FileTypeShardIndex represents a *.index file
	FileTypeShardIndex
	// FileTypeFieldData represents a *.<fieldname> file
	FileTypeFieldData
)

type GenerateReadShardsOpts

type GenerateReadShardsOpts struct {
	// Range defines an optional row shard range. Only records in this range will
	// be returned by Scan() and Read(). If Range is unset, the universal range is
	// assumed. See also ReadOpts.Range.
	Range biopb.CoordRange

	// SplitMappedCoords allows GenerateReadShards to split mapped reads of
	// the same <refid, alignment position> into multiple shards. Setting
	// this flag true will cause shard size to be more even, but the caller
	// must be able to handle split reads.
	SplitMappedCoords bool
	// SplitUnmappedCoords allows GenerateReadShards to split unmapped
	// reads into multiple shards. Setting this flag true will cause shard
	// size to be more even, but the caller must be able to handle split
	// unmapped reads.
	SplitUnmappedCoords bool
	// CombineMappedAndUnmappedCoords allows creating a shard that contains both
	// mapped and unmapped reads. If this flag is false, shards are always split
	// at the start of unmapped reads.
	AlwaysSplitMappedAndUnmappedCoords bool

	// BytesPerShard is the target shard size, in bytes across all fields.  If
	// this field is set, NumShards is ignored.
	BytesPerShard int64
	// NumShards specifies the number of shards to create. This field is ignored
	// if BytePerShard>0. If neither BytesPerShard nor NumShards is set,
	// runtime.NumCPU()*4 shards will be created.
	NumShards int
}

GenerateReadShardsOpts defines options to GenerateReadShards.

type ShardIndex

type ShardIndex struct {
	// Range is the coordinate range that this object represents. Records and indexes from the
	// source PAM that don't intersect this range were ignored.
	Range biopb.CoordRange
	// ApproxFileBytes is an estimate of the total file size of records in Range (in the
	// underlying PAM)
	ApproxFileBytes int64
	// Blocks is a sequence of index entries from one PAM field that span Range.
	Blocks []biopb.PAMBlockIndexEntry
}

ShardIndex is data derived from one PAM file index information used by the sharder.

func ReadIndexes

func ReadIndexes(ctx context.Context, path string, rng biopb.CoordRange, fields []string) ([]ShardIndex, error)

ReadIndexes reads the ShardIndexes for the PAM file at path, within rng. If the PAM contains no records in rng, returns an empty slice.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL