indexcov

package
v0.2.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 14, 2024 License: MIT Imports: 37 Imported by: 1

README

indexcov

Quickly estimate coverage from a whole-genome bam or cram index. A bam index has 16KB resolution so that's what this gives, but it provides what appears to be a high-quality coverage estimate in seconds per genome.

The output is scaled to around 1. So a long stretch with values of 1.5 would be a heterozygous duplication. This is useful as a quick QC to get coverage values across the genome.

In our tests, we can estimate depth across 60X genomes for 30 samples in 30 seconds.

Interactive HTML plots of depth are output for each chromosome. Live examples of the interactive output are available here

Usage

goleft indexcov --directory my-project-dir/ *.bam

This will create a number of text files described in the Files section below.

In addition, it will write a few .html files containing interactive plots.

For example, if we view the $prefix-indexcov-depth-X.html file for X chromosome we can see a nice separation of samples by sex except at the PAR at the left:

X Example

That plot is taken directly from the HTML output by indexcov.

Using that separation, indexcov infers the copy-number of the sex chromosomes, outputs a stub .ped/.fam file with that information, and makes a plot like this one:

Sex Example Where here the males and females separate by the X and Y chromosomes perfectly.

In some cases, we have found XXY and XYY samples this way.

indexcov will output a coverage (ROC) plot that shows how much of the genome is coverage at at given (scaled) depth. This is output to a $prefix-depth-roc.html file and looks like: ROC Example

Here we can see that one sample has much lower coverage than the rest, and we can hover and determine the exact sample.

Finally, indexcov will output a $prefix-indexcov-bins.html file with a point per sample. Samples with high values on the y-axis have very uneven coverage (this will affect SV calling). Samples with high values on There x-axis have many missing bins (likely truncated bam files).

CRAM

CRAM indexes are supported. Since there is not a full CRAM parser available in go yet, indexcov uses only the .crai files and requires a .fasta.fai to be sent via --fai so that it knows the chromosome names around lengths. The sample names are inferred from the file names. crai resolution is often much less than 100KB (compared to) 16KB for the bam index, but it is sufficient to find large-scale differences in coverage.

Example usage with cram looks like:

goleft indexcov --extranormalize -d output/ --fai h human_g1k_v37.fasta.fai /path/to/*.crai

note that the .fai (not the fasta) is required and that the files are .crai (not cram).

The --extranormalize flag greatly improves the results on CRAM (crai) files.

How It Works

The bam index stores a linear index for each chromosome indicating the file (and virtual) offset for every 16,384 bases in that chromosome. Since we know the total number of 16,384 base intervals in the index and the size of the bam file (from the last file offset stored in the index), we know the average size (in bytes) of taken by each 16,384 base chunk. So, we iterate over each (16KB) element in the linear index, subtract the previous file offset, and scale by the expected (average) size. This gives the scaled value for each 16,384-base chunk. There are many ways that this value can be off, but, in practice, it works well as a rough estimate.

Because of this indexcov is of less-use on exome or targetted capture, but those will be very fast to run with goleft depth anyway.

Files

In addition to the interactive HTML files, indexcov outputs a number of text files:

  • $prefix-indexcov.ped: a .ped/.fam file with the inferred sex in the appropriate column if the sex chromosomes were found. the CNX and CNY columns indicating the floating-point estimate of copy-number for those chromosomes. bins.out: how many bins had a coverage value outside of (0.85, 1.15). high values can indicate high-bias samples. bins.lo: number of bins with value < 0.15. high values indicate missing data. bins.hi: number of bins with value > 1.15. bins.in: number of bins with value inside of (0.85, 1.15) p.out: bins.out/bins.in PC1...PC5: PCA projections calculated with depth of autosomes.

  • $prefix-indexcov.roc: tab-delimited columns of chrom, scaled coverage cutoff, and $n_samples columns where each indicates the proportion of 16KB blocks at or above that scaled coverage value.

  • $prefix-indexcov.bed.gz: a bed file with columns of chrom, start, end, and a column per sample where the values indicate there scaled coverage for that sample in that 16KB chunk.

Documentation

Index

Constants

View Source
const (
	// TileWidth is the length of the interval tiling used
	// in BAI and tabix indexes.
	TileWidth = 0x4000

	// StatsDummyBin is the bin number of the reference
	// statistics bin used in BAI and tabix indexes.
	StatsDummyBin = 0x924a
)

Variables

View Source
var MaxCN = float32(8)

MaxCN is the maximum normalized value.

View Source
var Ploidy = 2

Ploidy indicates the expected ploidy of the samples.

Functions

func CountsAtDepth added in v0.1.9

func CountsAtDepth(depths []float32, counts []int)

CountsAtDepth calculates the count of items in depths that are at 100 * d

func CountsROC added in v0.1.9

func CountsROC(counts []int) []float32

CountsROC returns a slice that indicates the cumulative proportion of 16KB chunks that were at least (normalized) depth given by their index.

func GetCN added in v0.1.9

func GetCN(depths [][]float32) []float64

GetCN returns an float per sample estimating the number of copies of that chromosome. It is a very crude estimate, but that's what indexcov is and it tends to work well.

func GetShortName added in v0.1.14

func GetShortName(b string, isCrai bool) (string, error)

func Main

func Main()

Main is called from the goleft dispatcher

func ReadFai added in v0.1.14

func ReadFai(path string, chrom string) []*sam.Reference

ReadFai returns a slit of references from the fai path. If chrom is "" all chromosomes are returned.

func RefsFromBam added in v0.1.14

func RefsFromBam(path string, chrom string) []*sam.Reference

Types

type Index added in v0.1.9

type Index struct {
	*bam.Index
	// contains filtered or unexported fields
}

Index wraps a bam.Index to cache calculated values.

func ReadIndex added in v0.1.14

func ReadIndex(path string) *Index

ReadIndex returns an Index pointer from the specified bam or crai path.

func (*Index) NormalizedDepth added in v0.1.9

func (x *Index) NormalizedDepth(refID int) []float32

NormalizedDepth returns a list of numbers for the normalized depth of the given region. Values are scaled to have a mean of 1. If end is 0, the full chromosome is returned.

func (*Index) Path added in v0.2.1

func (i *Index) Path() string

func (*Index) Sizes added in v0.1.14

func (i *Index) Sizes() [][]int64

Sizes returns the size of each block in slices of chromosomes.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL