vcf

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 17, 2015 License: BSD-3-Clause Imports: 7 Imported by: 0

README

vcf

vcf is a golang package that parses data from an io.Reader adhering to the Variant Call Format v4.2 Specification.

Data is read asynchronously and returned through two channels, one with correctly parsed variants and one with unknown variants whose parsing failed. Proper initialization and buffering of these channels is a responsibility of the client.

This package is still work in progress, subject to change at any time without notice. Releases will follow Semantic Versioning 2.0.0. Major is still in v0 to reflect the early stage development this package is in.

INFO

Currently, parsing can handle Samples, optional fields such as ID, Quality and Filter, as well as the INFO field. INFO is exposed in two ways:

  • As a map[string]interface{} exposing all fields found on the INFO for each variant, without any treatment. Key-value pairs are added to this map. In the case of keys such as DB which don't have a value, the value used is a true boolean.
  • As a series of sub-fields listed on section 1.4.1-8 of the VCF 4.2 spec. These sub-fields are provided in a best effort manner. Failure to parse one of these sub-fields will only cause its corresponding pointer to be nil, not generating an error. The raw data can always be found on the map.
Genotype fields

Genotype fields (section 1.4.2 on the spec) do not have the same kind of treatment yet. They are separated by sample, but the only form represented is a raw map. Easy access to sub-fields is intended in the future.

Structural variants

Structural variants have not been addressed as of version 0.1.0.

License

This software uses the BSD 3-Clause License.


Copyright (c) 2015, Mendelics Análise Genômica S.A. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  • Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Documentation

Overview

Package vcf provides an API for parsing genomic data compliant with the Variant Call Format 4.2 Specification

This API is built with channels, assuming asynchronous computation. Variants parsed successfully are sent immediately to the consumer of the API through a channel, as well as variants that fail to be processed.

Example

Channels should be initialized and passed to the ToChannel function. The client should not close the channels This will happen inside ToChannel, when the input is exhausted.

validVariants := make(chan *Variant, 100)      // buffered channel for correctly parsed variants
invalidVariants := make(chan InvalidLine, 100) // buffered channel for variants that fail to parse

filename := "example_vcfs/test.vcf"

vcfFile, err := os.Open(filename)
if err != nil {
	log.Fatalln("can't open file", filename)
}
defer vcfFile.Close()

go func() {
	err := ToChannel(vcfFile, validVariants, invalidVariants)
	if err != nil {
		log.Fatalln(err)
	}
}()

go func() {
	// consume invalid variants channel asynchronously
	for invalid := range invalidVariants {
		fmt.Println("failed to parse line", invalid.Line, "with error", invalid.Err)
	}
}()

for variant := range validVariants {
	fmt.Println(variant)
	if variant.Qual != nil {
		fmt.Println("Quality:", *variant.Qual)
	}
	fmt.Println("Filter:", variant.Filter)
	fmt.Println("Allele Count:", *variant.AlleleCount)
	fmt.Println("Allele Frequency:", *variant.AlleleFrequency)
	fmt.Println("Total Alleles:", *variant.TotalAlleles)
	fmt.Println("Depth:", *variant.Depth)
	fmt.Println("Mapping Quality:", *variant.MappingQuality)
	fmt.Println("MAPQ0 Reads:", *variant.MAPQ0Reads)

	rawInfo := variant.Info
	vqslod := rawInfo["VQSLOD"]
	fmt.Println("VQSLOD:", vqslod)
}
Output:

Chromosome: 1 Position: 762588 Reference: G Alternative: C
Quality: 40
Filter: PASS
Allele Count: 2
Allele Frequency: 1
Total Alleles: 2
Depth: 5
Mapping Quality: 43.32
MAPQ0 Reads: 0
VQSLOD: 1.18

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func SampleIDs

func SampleIDs(reader io.Reader) ([]string, error)

SampleIDs reads a vcf header from an io.Reader and returns a slice with all the sample IDs contained in that header. If there are no samples on the header, a nil slice is returned

Example
filename := "example_vcfs/testsamples.vcf"
vcfFile, err := os.Open(filename)
if err != nil {
	log.Fatalln("can't open file", filename)
}
defer vcfFile.Close()

sampleIDs, err := SampleIDs(vcfFile)
if err == nil && sampleIDs != nil {
	for i, sample := range sampleIDs {
		fmt.Printf("sample %d: %s\n", i, sample)
	}
}
Output:

sample 0: 111222

func ToChannel

func ToChannel(reader io.Reader, output chan<- *Variant, invalids chan<- InvalidLine) error

ToChannel reads from an io.Reader and puts all variants into an already initialized channel. Variants whose parsing fails go into a specific channel for failing variants. If any of the two channels are full, ToChannel will block. The consumer must guarantee there is enough buffer space on the channels. Both channels are closed when the reader is fully scanned.

Types

type InvalidLine

type InvalidLine struct {
	Line string
	Err  error
}

InvalidLine represents a VCF line that could not be parsed. It encapsulates the problematic line with its corresponding error.

type SVType added in v1.1.0

type SVType int
const (
	Deletion SVType = iota
	Duplication
	Insertion
	Inversion
	CopyNumberVariation
	TandemDuplication
	DeletionMobileElement
	InsertionMobileElement
	Breakend
)

type Variant

type Variant struct {
	// Required fields
	Chrom string
	Pos   int
	Ref   string
	Alt   string

	ID string

	// Qual is a pointer so that it can be set to nil when it is a dot '.'
	Qual *float64

	Filter string

	// Info is a map containing all the keys present in the INFO field, with their corresponding value.
	// For keys without corresponding values, the value is a `true` bool.
	// No attempt at parsing is made on this field, data is raw.
	// The only exception is for multiple alternatives data. These are reported separately for each variant.
	Info map[string]interface{}

	// Genotype fields for each sample
	Samples []map[string]string

	// Optional info fields. These are the reserved fields listed on the VCF 4.2 spec, session 1.4.1, number 8.
	// The parsing is lenient, if the fields do not conform to the expected type listed here, they will be set to nil.
	// The fields are meant as helpers for common scenarios, since the generic usage is covered by the Info map.
	// Definitions used in the metadata section of the header are not used.
	AncestralAllele *string
	Depth           *int
	AlleleFrequency *float64
	AlleleCount     *int
	TotalAlleles    *int
	End             *int
	MAPQ0Reads      *int
	NumberOfSamples *int
	MappingQuality  *float64
	Cigar           *string
	InDBSNP         *bool
	InHapmap2       *bool
	InHapmap3       *bool
	IsSomatic       *bool
	IsValidated     *bool
	In1000G         *bool
	BaseQuality     *float64
	StrandBias      *float64

	// Structural variants
	Imprecise                        *bool
	Novel                            *bool
	StructuralVariantType            *SVType
	StructuralVariantLength          *int
	ConfidenceIntervalAroundPosition *int
	ConfidenceIntervalAroundEnd      *int
}

Variant is a struct representing the fields specified in the VCF 4.2 spec.

When the variant is generated through the API of the vcf package, the required fields are guaranteed to be valid, otherwise the parsing for the variant fails and is reported.

Multiple alternatives are parsed as separated instances of the type Variant. All other fields are optional and will not cause parsing fails if missing or non-conformant.

func (*Variant) String

func (v *Variant) String() string

String provides a representation of the variant key: the fields Chrom, Pos, Ref and Alt compatible with fmt.Stringer

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL