vcfgo

package module
v0.0.0-...-51f8e80 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 13, 2021 License: MIT Imports: 16 Imported by: 0

README

GoDoc Build Status Coverage Status

vcfgo is a golang library to read, write and manipulate files in the variant call format.

This implementation is a fork of github.com/brentp/vcfgo. It was created to allow for reading VCF files that did not comply with the strict header parsing rules of the original vcfgo. This version allows for additional fields to be specified for INFO and FORMAT meta-information lines (as allowed in the VCFv4.3 spec). It also allows the fields to occur in any order.

Because of the changes, while this package is similar, it is NOT a drop-in replacement for the original vcfgo package. In general it allows for looser parsing of VCF headers. The documentation below is as-per the original package except where necessary to reflect the changes made.

vcfgo

-- import "github.com/brentp/vcfgo"

Package vcfgo implements a Reader and Writer for variant call format. It eases reading, filtering modifying VCF's even if they are not to spec. Example:

Usage

f, _ := os.Open("examples/test.auto_dom.no_parents.vcf")
rdr, err := vcfgo.NewReader(f, false)
if err != nil {
    panic(err)
}
for {
    variant := rdr.Read()
    if variant == nil {
        break
    }
    fmt.Printf("%s\t%d\t%s\t%v\n", variant.Chromosome, variant.Pos, variant.Ref(), variant.Alt())
    dp, err := variant.Info().Get("DP")
    fmt.Printf("depth: %v\n", dp.(int))
    sample := variant.Samples[0]
    // we can get the PL field as a list (-1 is default in case of missing value)
    PL, err := variant.GetGenotypeField(sample, "PL", -1)
    if err != nil {
        panic(err)
    }
    fmt.Printf("%v\n", PL)
    _ = sample.DP
}
fmt.Fprintln(os.Stderr, rdr.Error())

Status

vcfgo is well-tested, but still in development. It tries to tolerate, but report errors; after every rdr.Read() call, the caller can check rdr.Error() and get feedback on the errors without stopping execution unless it is explicitly requested to do so.

Info and sample fields are pre-parsed and stored as map[string]interface{} so callers will have to cast to the appropriate type upon retrieval.

type Header
type Header struct {
	SampleNames   []string
	Infos         map[string]*Info
	SampleFormats map[string]*SampleFormat
	Filters       map[string]string
	Extras        map[string]string
	FileFormat    string
	// contid id maps to a map of length, URL, etc.
	Contigs map[string]map[string]string
}

Header holds all the type and format information for the variants.

func NewHeader
func NewHeader() *Header

NewHeader returns a Header with the requisite allocations.

type Info
type Info struct {
	Id          string
	Description string
	Number      string // A G R . ''
	Type        string // STRING INTEGER FLOAT FLAG CHARACTER UNKONWN
}

Info holds the Info and Format fields

func (*Info) String
func (i *Info) String() string

String returns a string representation.

type InfoMap
type InfoMap map[string]interface{}

InfoMap holds the parsed Info field which can contain floats, ints and lists thereof.

func (InfoMap) String
func (m InfoMap) String() string

String returns a string that matches the original info field.

type Reader
type Reader struct {
	Header *Header

	LineNumber int
}

Reader holds information about the current line number (for errors) and The VCF header that indicates the structure of records.

func NewReader
func NewReader(r io.Reader, lazySamples bool) (*Reader, error)

NewReader returns a Reader.

func (*Reader) Clear
func (vr *Reader) Clear()

Clear empties the cache of errors.

func (*Reader) Error
func (vr *Reader) Error() error

Error() aggregates the multiple errors that can occur into a single object.

func (*Reader) Read
func (vr *Reader) Read() *Variant

Read returns a pointer to a Variant. Upon reading the caller is assumed to check Reader.Err()

type SampleFormat
type SampleFormat Info

SampleFormat holds the type info for Format fields.

func (*SampleFormat) String
func (i *SampleFormat) String() string

String returns a string representation.

type SampleGenotype
type SampleGenotype struct {
	Phased bool
	GT     []int
	DP     int
	GL     []float32
	GQ     int
	MQ     int
	Fields map[string]string
}

SampleGenotype holds the information about a sample. Several fields are pre-parsed, but all fields are kept in Fields as well.

func NewSampleGenotype
func NewSampleGenotype() *SampleGenotype

NewSampleGenotype allocates the internals and returns a SampleGenotype

func (*SampleGenotype) String
func (sg *SampleGenotype) String(fields []string) string

String returns the string representation of the sample field.

type VCFError
type VCFError struct {
	Msgs  []string
	Lines []int
}

VCFError satisfies the error interface and allows multiple errors. This is useful because, for example, on a single line, every sample may have a field that doesn't match the description in the header. We want to keep parsing but also let the caller know about the error.

func NewVCFError
func NewVCFError() *VCFError

NewVCFError allocates the needed ingredients.

func (*VCFError) Add
func (e *VCFError) Add(err error, line int)

Add adds an error and the line number within the vcf where the error took place.

func (*VCFError) Clear
func (e *VCFError) Clear()

Clear empties the Messages

func (*VCFError) Error
func (e *VCFError) Error() string

Error returns a string with all errors delimited by newlines.

func (*VCFError) IsEmpty
func (e *VCFError) IsEmpty() bool

IsEmpty returns true if there no errors stored.

type Variant
type Variant struct {
	Chromosome      string
	Pos        		uint64
	Id         		string
	Ref        		string
	Alt        		[]string
	Quality    		float32
	Filter     		string
	Info       		InfoMap
	Format     		[]string
	Samples    		[]*SampleGenotype
	Header     		*Header
	LineNumber 		int
}

Variant holds the information about a single site. It is analagous to a row in a VCF file.

func (*Variant) GetGenotypeField
func (v *Variant) GetGenotypeField(g *SampleGenotype, field string, missing interface{}) (interface{}, error)

GetGenotypeField uses the information from the header to parse the correct time from a genotype field. It returns an interface that can be asserted to the expected type.

func (*Variant) String
func (v *Variant) String() string

String gives a string representation of a variant

type Writer
type Writer struct {
	io.Writer
	Header *Header
}

Writer allows writing VCF files.

func NewWriter
func NewWriter(w io.Writer, h *Header) (*Writer, error)

NewWriter returns a writer after writing the header.

func (*Writer) WriteVariant
func (w *Writer) WriteVariant(v *Variant)

WriteVariant writes a single variant

Documentation

Overview

Package vcfgo implements a Reader and Writer for variant call format. It eases reading, filtering modifying VCF's even if they are not to spec. Example:

f, _ := os.Open("examples/test.auto_dom.no_parents.vcf")
rdr, err := vcfgo.NewReader(f)
if err != nil {
	panic(err)
}
for {
	variant := rdr.Read()
	if variant == nil {
		break
	}
	fmt.Printf("%s\t%d\t%s\t%s\n", variant.Chromosome, variant.Pos, variant.Ref, variant.Alt)
	fmt.Printf("%s", variant.Info["DP"].(int) > 10)
	sample := variant.Samples[0]
	// we can get the PL field as a list (-1 is default in case of missing value)
	fmt.Println("%s", variant.GetGenotypeField(sample, "PL", -1))
	_ = sample.DP
}
fmt.Fprintln(os.Stderr, rdr.Error())
Example
package main

import (
	"fmt"
	"os"

	"github.com/brentp/vcfgo"
)

func main() {
	f, _ := os.Open("examples/test.auto_dom.no_parents.vcf")
	rdr, err := vcfgo.NewReader(f, false)
	if err != nil {
		panic(err)
	}
	for {
		variant := rdr.Read()
		if variant == nil {
			break
		}
		fmt.Printf("%s\t%d\t%s\t%s\n", variant.Chromosome, variant.Pos, variant.Ref(), variant.Alt())
		dp, _ := variant.Info().Get("DP")
		fmt.Printf("%v", dp.(int) > 10)
		
Output:

Index

Examples

Constants

View Source
const MISSING_VAL = 256

used for the quality score which is 0 to 255, but allows "."

Variables

View Source
var (
	ErrKeyNotFound  = errors.New("vcfgo: key not found")
	ErrDuplicateKey = errors.New("vcfgo: key cannot be added multiple times")
	ErrLinePattern  = errors.New("vcfgo: unexpected header line")
)

Functions

func ItoS

func ItoS(k string, v interface{}) string

Types

type Header struct {
	// Added by composition so functions which take this type as a
	// receiver will work with a Header as receiver.
	sync.RWMutex

	// Mandatory first header line for all VCFfiles.
	FileFormat string

	// Parsed from #CHROM line.
	SampleNames []string

	// This holds an array of meat-information lines
	// in the order in which they were observed in the original header.
	// It does not hold the fileformat meta line which is parsed
	// separately and nor does it hold the #CHROM line.
	Lines []*MetaLine

	// I think these are all headed to the scrap heap once I have the
	// Structured and Unstructured lists (maps?) working.
	Infos         map[string]*Info
	SampleFormats map[string]*SampleFormat
	Filters       map[string]string
	Extras        []string
	// Contigs is a list of maps of length, URL, etc.
	Contigs []map[string]string
	// ##SAMPLE
	Samples   map[string]string
	Pedigrees []string
}

Header holds a heap of valuable annotation without which it is very difficult to make sense of the variant records. While the meta-information lines are all optional (except for fileformt), a VCF without a substantial header is very difficult to use. At the absolute minimum the header should contain a FILTER line for each string used in the FILTER field (column 6, 0-based numbering), an INFO line for each element in the INFO field (column 7) and a FORMAT line for each element in the FORMAT field (column 8).

func NewHeader

func NewHeader() *Header

NewHeader returns a Header with the requisite allocations.

func (*Header) GetLineByTypeAndId

func (h *Header) GetLineByTypeAndId(t string, id string) (*MetaLine, error)

Returns all MetaLines in the Header that match the supplied type, e.g. `INFO`, `FORMAT`, `fileDate` and that have the supplied ID. Note that this will only work for structured meta-info lines and that by definition, within a type, there can only be one record with a given ID so an error is thrown if more than one MetaLine is found. Also note that the type and ID matching are both case sensitive. Also note that the return type is a pointer to the MetaLine held by the Header so if you change it, you change the original.

func (*Header) GetLinesByType

func (h *Header) GetLinesByType(t string) []*MetaLine

Returns all MetaLines in the Header that match the supplied type, e.g. `INFO`, `FORMAT`, `fileDate`. Note that the type matching is case sensitive so `info` and `INFO` are not interchangeable. Also note that the array returned is of pointers to the MetaLines held by the Header so if you change them, you change the originals.

func (*Header) ParseSamples

func (h *Header) ParseSamples(v *Variant) error

Force parsing of the sample fields.

type Info

type Info struct {
	Id          string
	Description string
	Number      string // A G R . ”
	Type        string // STRING INTEGER FLOAT FLAG CHARACTER UNKNOWN
	// contains filtered or unexported fields
}

Info holds the Info and Format fields

func NewInfo

func NewInfo() *Info

NewInfo allocates the internals and returns a *Info

func NewInfoFromString

func NewInfoFromString(s string) (*Info, error)

NewInfoFromString parses a key=value string and returns a *Info. Note that the string is not a full INFO line from the header but just that portion between < and > in the INFO line. For example `ID=DP,Number=1,Type=Integer,Description="Total Depth"` not `##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">`

func (*Info) GetKV

func (i *Info) GetKV(k string) (*KV, error)

GetKV returns the KV for a given key. If the key does not exist, an ErrKeyNotFound error is returned.

func (*Info) GetValue

func (i *Info) GetValue(k string) (string, error)

GetValue returns the value for a given key. If the key does not exist, an ErrKeyNotFound error is returned.

func (*Info) String

func (i *Info) String() string

String returns a string representation.

type InfoByte

type InfoByte struct {
	Info []byte
	// contains filtered or unexported fields
}

InfoByte holds the INFO field in a Variant record line. Called by Reader as each variant record in the VCF file is parsed.

func NewInfoByte

func NewInfoByte(info []byte, h *Header) *InfoByte

func (*InfoByte) Add

func (i *InfoByte) Add(key string, value interface{})

func (InfoByte) Bytes

func (i InfoByte) Bytes() []byte

func (InfoByte) Contains

func (i InfoByte) Contains(key string) bool

func (*InfoByte) Delete

func (i *InfoByte) Delete(key string)

func (InfoByte) Get

func (i InfoByte) Get(key string) (interface{}, error)

Get a value from the bytes typed according to the header.

func (InfoByte) Keys

func (i InfoByte) Keys() []string

func (InfoByte) SGet

func (i InfoByte) SGet(key string) []byte

func (*InfoByte) Set

func (i *InfoByte) Set(key string, value interface{}) error

func (InfoByte) String

func (i InfoByte) String() string

func (*InfoByte) UpdateHeader

func (i *InfoByte) UpdateHeader(key string, value interface{})

type KV

type KV struct {
	Key   string
	Value string

	// 0-based index of where this KV appeared in the original
	// string. It is used to recreate meta lines as strings with the
	// key=value pairs in the same order as they were in the original.
	Index int

	// Quote character ([`'"]) if any that was used for the value of the
	// key-value pair. The spec does not state that double quotes must
	// be used for all quoting but it may be so. In any case, we can cope
	// with any of the 3 quoting characters shown above. Quote is empty
	// if the Value was not quoted.
	Quote rune
}

KV holds a key=value pair. It can be used for structured meta-information lines such as INFO, FORMAT and FILTER. See section 1.4 from the VCFv4.3 specification (version 27 Jul 2021; retrieved 2021-09-05) at: https://samtools.github.io/hts-specs/VCFv4.3.pdf

type MetaLine

type MetaLine struct {
	LineNumber int

	// MetaType defaults to Unstructured. You can manually set this
	// value but it's best not to. Let the package do the work.
	MetaType MetaType

	// The basic XXX= value which is present in both STructured and
	// Unstructured MetaLines.
	LineKey string

	// Value is only used in Unstructured MetaLines - STructured
	// MetaLines use KVs and Order instead.
	Value string

	// KVs and Order contain the key=value items (as KV) from a
	// Structured MetaLine plus the order in which they occurred in the
	// OgString or the order in which they were added with AddKV().
	// The Order is obeyed by String()
	KVs   map[string]*KV
	Order []string

	// OgString is only available if the MetaLine was created via
	// NewMetaLineFromString().
	OgString string
}

MetaLine is designed to hold information from both structured and unstructured meta information lines from the VCF header. KVs and Order will only be set for structured lines and Value will only be set for unstructured lines. different fields set for the different MetaTypes.

func NewMetaLine

func NewMetaLine() *MetaLine

NewMetaLine returns a pointer to a MetaLine. By default, the MetaType is Unstructured. If you use the AddKV() function, MetaType will be automatically converted to Structured.

func NewMetaLineFromString

func NewMetaLineFromString(s string) (*MetaLine, error)

NewMetaLineFromString matches the input string against the pattern for Structured and Unstructured MetaLines and returns a MetaLine. If neither pattern matches, it throws an error.

func (*MetaLine) GetValue

func (m *MetaLine) GetValue(k string) string

GetValue takes a key and returns the value for that key from the MetaLine's key=value set. Only meaningful for Structured MetaLines and will always return an empty string for Unstructured MetaLines.

func (*MetaLine) String

func (m *MetaLine) String() (string, error)

String returns a string representation.

type MetaType

type MetaType int

MetaType - Create enum for header meta information line type.

const (
	Unstructured MetaType = iota // EnumIndex = 0
	Structured                   // EnumIndex = 1
)

Declare related constants for each MetaType starting with index 1

func (MetaType) EnumIndex

func (m MetaType) EnumIndex() int

EnumIndex - Creating common behaviour - give the type an EnumIndex function

func (MetaType) String

func (m MetaType) String() string

String - Creating common behaviour - give the type a String function

type Reader

type Reader struct {
	Header *Header

	LineNumber int
	// contains filtered or unexported fields
}

Reader holds information about the current line number (for errors) and The VCF header that indicates the structure of records.

func NewReader

func NewReader(r io.Reader, lazySamples bool) (*Reader, error)

NewReader returns a Reader. If lazySamples is true, then the user will have to call Reader.ParseSamples() in order to access simple info.

func NewWithHeader

func NewWithHeader(r io.Reader, h *Header, lazySamples bool) (*Reader, error)

func (*Reader) AddFormatToHeader

func (vr *Reader) AddFormatToHeader(id string, num string, stype string, desc string)

AddFormatToHeader adds a FORMAT field to the header.

func (*Reader) AddInfoToHeader

func (vr *Reader) AddInfoToHeader(id string, num string, stype string, desc string)

AddInfoToHeader adds a INFO field to the header.

func (*Reader) Clear

func (vr *Reader) Clear()

Clear empties the cache of errors.

func (*Reader) Close

func (vr *Reader) Close() error

func (*Reader) Error

func (vr *Reader) Error() error

Error() aggregates the multiple errors that can occur into a single object.

func (*Reader) GetHeaderType

func (vr *Reader) GetHeaderType(field string) string

func (*Reader) Parse

func (vr *Reader) Parse(fields [][]byte) *Variant

func (*Reader) Read

func (vr *Reader) Read() *Variant

Read returns a pointer to a Variant. Upon reading the caller is assumed to check Reader.Err()

type SampleFormat

type SampleFormat Info

SampleFormat holds the type info for Format fields.

func (*SampleFormat) GetKV

func (s *SampleFormat) GetKV(k string) (*KV, error)

GetKV returns the KV for a given key. If the key does not exist, an ErrKeyNotFound error is returned.

func (*SampleFormat) GetValue

func (s *SampleFormat) GetValue(k string) (string, error)

GetValue returns the value for a given key. If the key does not exist, an ErrKeyNotFound error is returned.

func (*SampleFormat) String

func (s *SampleFormat) String() string

String returns a string representation.

type SampleGenotype

type SampleGenotype struct {
	Phased bool
	GT     []int
	DP     int
	GL     []float64
	GQ     int
	MQ     int
	Fields map[string]string
}

SampleGenotype holds the information about a sample. Several fields are pre-parsed, but all fields are kept in Fields as well.

func NewSampleGenotype

func NewSampleGenotype() *SampleGenotype

NewSampleGenotype allocates the internals and returns a *SampleGenotype

func (*SampleGenotype) AltDepths

func (s *SampleGenotype) AltDepths() ([]int, error)

AltDepths returns the depths of the alternates for this sample

func (*SampleGenotype) RefDepth

func (s *SampleGenotype) RefDepth() (int, error)

RefDepth returns the depths of the alternates for this sample

func (*SampleGenotype) String

func (sg *SampleGenotype) String(fields []string) string

String returns the string representation of the sample field.

type VCFError

type VCFError struct {
	Msgs  []string
	Lines []int
}

VCFError satisfies the error interface and allows multiple errors. This is useful because, for example, on a single line, every sample may have a field that doesn't match the description in the header. We want to keep parsing but also let the caller know about the error.

func NewVCFError

func NewVCFError() *VCFError

NewVCFError allocates the needed ingredients.

func (*VCFError) Add

func (e *VCFError) Add(err error, line int)

Add adds an error and the line number within the vcf where the error took place.

func (*VCFError) Clear

func (e *VCFError) Clear()

Clear empties the Messages

func (*VCFError) Error

func (e *VCFError) Error() string

Error returns a string with all errors delimited by newlines.

func (*VCFError) IsEmpty

func (e *VCFError) IsEmpty() bool

IsEmpty returns true if there no errors stored.

type Variant

type Variant struct {
	Chromosome string
	Pos        uint64
	Id_        string
	Reference  string
	Alternate  []string
	Quality    float32
	Filter     string
	Info_      interfaces.Info
	Format     []string
	Samples    []*SampleGenotype

	Header     *Header
	LineNumber int
	// contains filtered or unexported fields
}

Variant holds the information about a single site. It is analagous to a row in a VCF file.

func SplitAlts

func SplitAlts(v *Variant) []*Variant

func (*Variant) Alt

func (v *Variant) Alt() []string

func (*Variant) CIEnd

func (v *Variant) CIEnd() (uint32, uint32, bool)

CIEnd reports the Left and Right end of an SV using the CIEND tag. It is in bed format so the end is +1'ed. E.g. If there is no CIEND, the return value is v.End() - 1, v.End()

func (*Variant) CIPos

func (v *Variant) CIPos() (uint32, uint32, bool)

CIPos reports the Left and Right end of an SV using the CIPOS tag. It is in bed format so the end is +1'ed. E.g. If there is not CIPOS, the return value is v.Start(), v.Start() + 1

func (*Variant) Chrom

func (v *Variant) Chrom() string

Chrom returns the chromosome name.

func (*Variant) End

func (v *Variant) End() uint32

End returns the 0-based start + the length of the reference allele.

func (*Variant) GetGenotypeField

func (v *Variant) GetGenotypeField(g *SampleGenotype, field string, missing interface{}) (interface{}, error)

GetGenotypeField uses the information from the header to parse the correct time from a genotype field. It returns an interface that can be asserted to the expected type.

func (*Variant) Id

func (v *Variant) Id() string

func (*Variant) Info

func (v *Variant) Info() interfaces.Info

func (*Variant) Ref

func (v *Variant) Ref() string

func (*Variant) Start

func (v *Variant) Start() uint32

Start returns the 0-based start

func (*Variant) String

func (v *Variant) String() string

String gives a string representation of a variant

type Writer

type Writer struct {
	io.Writer
	Header *Header
}

Writer allows writing VCF files.

func NewWriter

func NewWriter(w io.Writer, h *Header) (*Writer, error)

NewWriter returns a writer after writing the header.

func (*Writer) WriteVariant

func (w *Writer) WriteVariant(v *Variant)

WriteVariant writes a single variant

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL