repeatgenome

package
v0.0.0-...-69c58a0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 17, 2015 License: ISC Imports: 18 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DebugSeq

func DebugSeq()

func Less

func Less(a, b []byte) bool

Returns a bool describing whether the first TextSeq is lexicographically smaller than the second.

func TSRevComp

func TSRevComp(seq []byte) []byte

Returns the reverse complement of the supplied TextSeq. This drains memory and should therefore not be used outside of debugging and printing.

Types

type Chroms

type Chroms map[string](map[string][]byte)

A 2-dimensional map used to represent a newly-parsed FASTA-formatted reference genome.

type ClassID

type ClassID uint16

A type synonym representing a ClassNode by ID. Used to space-efficiently store a read's classification.

type ClassNode

type ClassNode struct {
	Name     string
	ID       ClassID
	Class    []string
	Parent   *ClassNode
	Children []*ClassNode
	Repeat   *Repeat
}

ClassNode.Name - This ClassNode's fully qualified name, excluding

root.

ClassNode.ID - A unique ID starting at 0 that we assign (not

included in RepeatMasker output). Root has ID 0.

ClassNode.Class - This ClassNode's name cut on "/". This likely

isn't necessary, and may be removed in the future.

ClassNode.Parent - A pointer to this ClassNode's parent in the

ancestry tree. It should be nil for root and only for root.

ClassNode.Children - A slice containing pointers to all of this

ClassNode's children in the tree.

ClassNode.Repeat - A pointer to this ClassNode's corresonding

Repeat, if it has one. This field is of dubious value.

func (*ClassNode) Size

func (classNode *ClassNode) Size() uint64

Returns the sum of the sizes of all repeat instances in the supplied ClassNode's subtree.

type ClassNodes

type ClassNodes []*ClassNode

func (ClassNodes) Write

func (classNodes ClassNodes) Write(filename string) error

type ClassTree

type ClassTree struct {
	ClassNodes map[string](*ClassNode)
	NodesByID  []*ClassNode
	Root       *ClassNode
}

ClassTree.ClassNodes - Maps a fully qualified class name (excluding

root) to that class's ClassNode struct, if it exists. This is
slower than ClassTree.NodesByID, and should only be used when
necessary.

ClassTree.NodesByID - A slice of pointers to all ClassNode structs,

indexed by ID. This should be the default means of accessing a
ClassNode.

ClassTree.Root - A pointer to the ClassTree's root, which has name

"root" and ID 0. We explicitly create this - it isn't present in
the RepeatMasker output.

func (*ClassTree) PrintBranches

func (classTree *ClassTree) PrintBranches()

Doesn't print leaves. Prevents the terminal from being flooded with Unknowns, Others, and Simple Repeats.

func (*ClassTree) PrintTree

func (classTree *ClassTree) PrintTree()

type Config

type Config struct {
	Name       string
	Debug      bool
	CPUProfile bool
	MemProfile bool
	WriteLib   bool
	ForceGen   bool
	WriteStats bool
}

A value of type Config is passed to the New() function, which constructs and returns a new RepeatGenome.

type JSONNode

type JSONNode struct {
	Name     string      `json:"name"`
	Size     uint64      `json:"size"`
	Children []*JSONNode `json:"children"`
}

Used only for recursively writing the JSON representation of the ClassTree.

type KRespPair

type KRespPair struct {
	KmerInt uint64
	Repeat  *Repeat
}

type Kmer

type Kmer [10]byte

This is what is stored by the main Kraken data structure: RepeatGenome.Kmers The first eight bits are the integer representation of the kmer's sequence (type KmerInt). The last two are the class ID (type ClassID).

func (Kmer) ClassID

func (kmer Kmer) ClassID() ClassID

A more declarative and modifiable accessor function. While it would almost certainly be inlined, this is such a performance-critical operation that this function isn't currently used.

func (Kmer) Int

func (kmer Kmer) Int() uint64

A more declarative and modifiable accessor function. While it would almost certainly be inlined, this is such a performance-critical operation that this function isn't currently used.

func (*Kmer) SetClassID

func (kmer *Kmer) SetClassID(classID ClassID)

func (*Kmer) SetInt

func (kmer *Kmer) SetInt(kmerInt uint64)

type KmerInts

type KmerInts []uint64

A two-bits-per-base sequence of up to 31 bases, with low-order bits

 occupied first.
00 = 'a'
01 = 'c'
10 = 'g'
11 = 't'

The definitions of KmerInt was previously here, but I reverted to uint64 for
simplicity.

type Kmers

type Kmers []Kmer

func (Kmers) Len

func (kmers Kmers) Len() int

func (Kmers) Less

func (kmers Kmers) Less(i, j int) bool

func (Kmers) Swap

func (kmers Kmers) Swap(i, j int)

type MRespPair

type MRespPair struct {
	MinInt uint32
	Repeat *Repeat
}

type MinInts

type MinInts []uint32

A two-bits-per-base sequence of up to 15 bases, with low-bits

occupied first.

The definitions of MinInt was previously here, but I reverted to uint32 for
simplicity.

func (MinInts) Len

func (minInts MinInts) Len() int

func (MinInts) Less

func (minInts MinInts) Less(i, j int) bool

func (MinInts) Swap

func (minInts MinInts) Swap(i, j int)

type PKmers

type PKmers []*Kmer

func (PKmers) Len

func (pkmers PKmers) Len() int

needed for sort.Interface

func (PKmers) Less

func (pkmers PKmers) Less(i, j int) bool

func (PKmers) Swap

func (pkmers PKmers) Swap(i, j int)

type ReadResponse

type ReadResponse struct {
	Seq       []byte
	ClassNode *ClassNode
}

The type sent back from read-classifying goroutines of RepeatGenome.ClassifyReads()

func (ReadResponse) HangingSize

func (readResp ReadResponse) HangingSize() uint64

Returns the number of base pairs from which the supplied read could have originated, assuming that its classification was correct. This is done in terms of Kraken-Q logic, meaning that there is at least one kmer shared between the repeat reference and the read. Therefore, the read must overlap a repeat reference from the classified subtree by at least k bases. This function is used to calculate the probability of correct classification assuming random selection, and the amount to which a classification narrows a read's potential origin.

type ReadSAM

type ReadSAM struct {
	TextSeq  []byte
	SeqName  string
	StartInd uint64
}

func GetReadSAMs

func GetReadSAMs(readsDirPath string) (error, []ReadSAM)

Passes all file names in the dir to parseReadSAMs and returns the concatenated results.

type ReadSAMRepeat

type ReadSAMRepeat struct {
	ReadSAM ReadSAM
	Repeat  *Repeat
}

type ReadSAMResponse

type ReadSAMResponse struct {
	ReadSAM   ReadSAM
	ClassNode *ClassNode
}

type ReducePair

type ReducePair struct {
	LcaPtr *ClassID
	Set    Kmers
}

type Repeat

type Repeat struct {
	ID        uint64
	Name      string
	ClassList []string
	ClassNode *ClassNode
	Instances []*bioutils.Match
}

Repeat.ID - A unique ID that we assign (not included in

RepeatMasker output). Because these are assigned in the order in
which they are encountered in <genome name>.fa.out, they are not
compatible across even different versions of the same reference
genome. This may change.

Repeat.Name - The repeat's fully qualified name, excluding root. Repeat.ClassList - A slice of this Repeat's class ancestry from the

top of the tree down, excluding root.

Repeat.ClassNode - A pointer to the ClassNode which corresponds to

this repeat.

Repeat.Instances - A slice of pointers to all matches that are

instances of this repeat.

func (*Repeat) Print

func (repeat *Repeat) Print()

func (*Repeat) Size

func (repeat *Repeat) Size() uint64

Returns the sum of the sizes of all of a repeat sequence type's instances.

type RepeatGenome

type RepeatGenome struct {
	Name string

	Kmers      Kmers
	MinOffsets []int64
	MinCounts  []uint32
	SortedMins MinInts
	Matches    bioutils.Matches
	ClassTree  ClassTree
	Repeats    Repeats
	RepeatMap  map[string]*Repeat
	// contains filtered or unexported fields
}

RepeatGenome.Name - The name of the reference genome, such as "dm3"

or "hg38". This is used to name created directories, and to find
directories and files that may be read from, such as a stored
Kraken library and reference sequences.

RepeatGenome.chroms - A 2-dimensional map mapping a chromosome name

to a map of its sequence names to their sequences (in text form).
Actual 2-dimensional mapping is currently impossible because of
RepeatMasker's 1-dimensional output.

RepeatGenome.Kmers - A slice of all Kmers, sorted primarily by

minimizer and secondarily by lexicographical value.

RepeatGenome.MinOffsets - Maps a minimizer to its offset in the

Kmers slice, or -1 if no kmers of this minimizer were stored.

RepeatGenome.MinCounts - Maps a minimizer to the number of stored

kmers associated with it.

RepeatGenome.SortedMins - A sorted slice of all minimizers of

stored kmers.

RepeatGenome.Matches - All matches, indexed by their assigned IDs. RepeatGenome.ClassTree - Contains all information used for LCA

determination and read classification. It may eventually be
collapsed into RepeatGenome, as accessing it is rather verbose.

RepeatGenome.Repeats - A slice of all repeats, indexed by their

assigned IDs.

RepeatGenome.RepeatMap - Maps a fully qualified repeat name,

excluding root, to its struct.

func New

func New(config Config) (error, *RepeatGenome)

func (*RepeatGenome) AvgPossPercentGenome

func (rg *RepeatGenome) AvgPossPercentGenome(resps []ReadResponse, strict bool) float64

Returns the average percent of the genome a read from the given set could have originated from, assuming their classification was correct. This is used to estimate how much the classification assisted us in locating reads' origins. The more specific and helpful the classifications are, the lower the percentage will be. Uses a cumulative average to prevent overflow.

func (*RepeatGenome) GetClassChan

func (rg *RepeatGenome) GetClassChan(reads [][]byte, useLCA bool) chan ReadResponse

Dispatches as many read-classifying goroutines as there are CPUs, giving each a subslice of the slice of reads provided. Each read-classifying goroutine is given a unique response chan. These are then merged into a single response chan, which is the return value. The useLCA parameter determines whether to use Quick or LCA read classification logic.

func (*RepeatGenome) GetKmerMap

func (rg *RepeatGenome) GetKmerMap() (int, int, map[uint64]*Repeat)

func (*RepeatGenome) GetMatchSpans

func (rg *RepeatGenome) GetMatchSpans() map[string]matchSpans

func (*RepeatGenome) GetMinMap

func (rg *RepeatGenome) GetMinMap() (int, int, map[uint32]*Repeat)

func (*RepeatGenome) GetProcReads

func (rg *RepeatGenome) GetProcReads() (error, [][]byte)

A rather hairy function that classifies all reads in ./<genome-name>-reads/*.proc if any exist. .proc files are our own creation for ease of parsing and testing. They contain one lowercase read sequence per line, and nothing else. We have a script that will convert FASTQ files to .proc files: github.com/mmcco/bioinformatics/blob/master/scripts/format-FASTA-reads.py This is generally really easy to do. However, we will used a FASTQ reader when we get past the initial testing phase. This could be done concurrently, considering how many disk accesses there are.

func (*RepeatGenome) GetReads

func (rg *RepeatGenome) GetReads() (error, [][]byte)

func (*RepeatGenome) KmerClassifyRead

func (rg *RepeatGenome) KmerClassifyRead(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)

func (*RepeatGenome) KmerClassifyReadVerb

func (rg *RepeatGenome) KmerClassifyReadVerb(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)

func (*RepeatGenome) KmersGBSize

func (rg *RepeatGenome) KmersGBSize() float64

Returns the size in gigabytes of the supplied RepeatGenome's Kmers field.

func (*RepeatGenome) LCA_ClassifyReads

func (rg *RepeatGenome) LCA_ClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)

Classifies each read in a slice of reads, stored as type []byte. The read and its classification are returned through responseChan. In the future, the reads may be of type Seq. However, this currently seems to be the fastest way of doing things. This version returns the LCA of all recognized kmers' classifications.

func (*RepeatGenome) MinClassifyRead

func (rg *RepeatGenome) MinClassifyRead(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)

func (*RepeatGenome) MinClassifyReadVerb

func (rg *RepeatGenome) MinClassifyReadVerb(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)

func (*RepeatGenome) PercentRepeats

func (rg *RepeatGenome) PercentRepeats() float64

Returns the percent of a RepeatGenome's reference bases that are contained in a repeat instance. It makes the assumption that no base is contained in more than one repeat instance.

func (*RepeatGenome) PercentTrueClassifications

func (rg *RepeatGenome) PercentTrueClassifications(responses []ReadSAMResponse, useStrict bool) float64

Determines whether a read overlaps any repeat instances in the given ClassNode's subtree. If the argument strict is true, the read must be entirely contained in a reference repeat instance (classic Kraken logic). Otherwise, the read must overlap a reference repeat instance by at least k bases.

func (*RepeatGenome) PrintChromInfo

func (refGenome *RepeatGenome) PrintChromInfo()

func (*RepeatGenome) QuickClassifyReads

func (rg *RepeatGenome) QuickClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)

Classifies each read in a slice of reads, stored as type []byte. The read and its classification are returned through responseChan. In the future, the reads may be of type Seq. However, this currently seems to be the fastest way of doing things. This version uses the first recognized kmer for classification - the Kraken-Q technique.

func (*RepeatGenome) ReadKraken

func (rg *RepeatGenome) ReadKraken(infile *os.File) error

has a lot of error handling, but pretty simple logic

func (*RepeatGenome) RepeatIsCorrect

func (rg *RepeatGenome) RepeatIsCorrect(readSAMRepeat ReadSAMRepeat, strict bool) bool

func (*RepeatGenome) RunDebugTests

func (rg *RepeatGenome) RunDebugTests()

func (*RepeatGenome) Size

func (repeatGenome *RepeatGenome) Size() uint64

Returns the total number of bases in a RepeatGenome's reference chromosomes.

func (*RepeatGenome) SplitChromsK

func (rg *RepeatGenome) SplitChromsK() (chan KRespPair, chan KRespPair)

func (*RepeatGenome) SplitChromsM

func (rg *RepeatGenome) SplitChromsM() (chan MRespPair, chan MRespPair)

func (*RepeatGenome) WriteClassJSON

func (rg *RepeatGenome) WriteClassJSON(useCumSize, printLeaves bool) error

Writes a JSON representation of the class tree. Used by the Javascript visualization, among other things. Currently, each node is associated with a value "size", the number of kmers associated with it. useCumSize determines whether the kmer count is cumulative, counting all kmers in its subtree.

func (*RepeatGenome) WriteKraken

func (rg *RepeatGenome) WriteKraken() error

func (*RepeatGenome) WriteStatData

func (rg *RepeatGenome) WriteStatData() error

type Repeats

type Repeats []*Repeat

func (Repeats) Write

func (repeats Repeats) Write(filename string) error

type ResponsePair

type ResponsePair struct {
	Kmer   Kmer
	MinInt uint32
}

The type returned by RepeatGenome.getMatchKmers(), which process raw kmers. The LCA contained in the Kmer value is not the Kmer's final LCA, but simply the ClassNode ID of the match this instance of the Kmer came from.

type Seq

type Seq struct {
	Bytes []byte
	Len   uint64
}

Each base is represented by two bits. High-order bits are occupied first. Remember that Seq.Len is the number of bases contained, while len(Seq.Bytes) is the number of bytes necessary to represent them.

func GetSeq

func GetSeq(textSeq []byte) Seq

Converts a TextSeq to the more memory-efficient Seq type. Upper- and lower-case base bytes are currently supported, but stable code should immediately convert to lower-case. The logic works and is sane, but could be altered in the future for brevity and efficiency.

func (Seq) GetBase

func (seq Seq) GetBase(i uint64) uint8

Return the i-th byte of the Seq (zero-indexed).

func (Seq) Print

func (seq Seq) Print()

func (Seq) Subseq

func (seq Seq) Subseq(a, b uint64) Seq

Return the subsequence of the supplied Seq from a (inclusive) to b (exclusive), like a slice.

type Seqs

type Seqs []Seq

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL