bystro-vcf

command module

v0.0.0-...-a3bed25 Latest Latest Go to latest Published: Apr 25, 2024 License: Apache-2.0 Imports: 17 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/bystrogenomics/bystro-vcf

README ¶

bystro-vcf

TL;DR

Annotate VCF files at millions of variants per minute. Saturates pigz/gunzip on a 4-core CPU.

go get github.com/akotlar/bystro-vcf && go install $_;

pigz -p 1 -d -c in.vcf.gz | bystro-vcf --keepId --keepInfo | pigz -c - > output

Description

Performs several important functions:

Splits multiallelics and MNP alleles, keeping track of each allele's index with respect to the original alleles for downstream INFO property segregation
Performs QC on variants: checks whether allele contains ACTG, that padding bases match reference, and more
Allows filtering of variants by any number of FILTER properties (by default allows PASS/. variants)
Normalizes indel representations by removing padding, left shifting alleles to their parsimonious representations
Calculates whether site is transition, transversion, or neither
Processes all available samples
- calculates homozygosity, heterozygosity, missingness
- labels samples as homozygous, heterozygous, or missing

Publication

bystro-vcf is used to pre-proces VCF files for Bystro (github)

If you use bystro-vcf please cite https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1387-3

Performance

Millions of variants/rows per minute. Performance is dependent on the # of samples.

Ex:

Amazon i3.2xlarge (4 core), 1K Genomes Phase 3 (2,504 samples): chromosome 1 (6.2M variants) in ~2 minutes 45s

Runs @ ~ pigz -p 1 streaming decompression limit (97% CPU, 2% sys post-Meltdown/Spectre).

==> ( time pigz -d -c -p 1 ../../../mnt/annotator/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bystro-vcf  &> /dev/null ; )

real	2m45.134s
user	16m20.512s
sys	0m25.940s

Installation

go get github.com/akotlar/bystro-vcf && go install $_;

Use

Via pipe:

pigz -d -c in.vcf.gz | bystro-vcf --keepId --keepInfo --allowFilter "PASS,." | pigz -c - > out.gz

Via inPath argument:

bystro-vcf --in in.vcf --keepId --keepInfo --allowFilter "PASS,." > out

Output

chrom <String>   pos <Int>   type <String[SNP|DEL|INS|MULTIALLELIC]>    ref <String>    alt <String>    trTv <Int[0|1|2]>     heterozygotes <String>     heterozygosity <Float64>    homozygotes <String>     homozygosity <Float64>     missingGenos <String>    missingness <Float64>    sampleMaf <Float64>    id <String?>    alleleIndex <Int?>   info <String?>

Optional arguments

--keepId <Bool>

Retain the "ID" field in the output.

--keepInfo <Bool>

Retain the "INFO" field in the output.

Since we decompose multiallelics, an "alleleIdx" field is added to the output. It contains the 0-based index of that allele in the multiallelic
This is necessary for downstream programs to decompose the INFO field per-allele

Results in 2 output fields, following missingGenos or id should --keepId be set

alleleIdx will contain the index of allele in a split multiallelic. 0 by default.
info will contain the entire INFO string

--allowFilter <String>

Which FILTER values to keep. Comma separated. Defaults to "PASS,.".

If passed "" (empty string) or "*" (wildcard) will allow all FILTER values.

Similar to https://samtools.github.io/bcftools/bcftools.html -f, --apply-filters LIST

--excludeFilter <String>

Which FILTER values to exclude. Comma separated. Defaults to ""

Opposite of https://samtools.github.io/bcftools/bcftools.html -f, --apply-filters LIST

--in /path/to/uncompressedFile.vcf

An input file path, to an uncompressed VCF file. Defaults to stdin

--out <String>

Send the output here instead of STDOUT

--err /path/to/log.txt

Where to store log messages. Defaults to stderr

--emptyField "!"

Which value to assign to missing data. Defaults to !

--fieldDelimiter ";"

Which delimiter to use when joining multiple values. Defaults to ;

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
arrow

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL