markduplicates

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 6, 2021 License: Apache-2.0 Imports: 37 Imported by: 0

Documentation

Overview

Command bio-mark-duplicates marks or removes duplicates from .bam

files.

This tool is meant to replicate the behavior of picard MarkDuplicates
in a more scalable, and efficient way.  The intention is to make
bio-mark-duplicates scale well with machines exceeding 16 cores.

Duplicate Marking Concepts:

At the conceptual level, this tool considers two reads A and B as
duplicates (isDuplicate(A, B)) if their:
  1) reference
  2) unclipped 5' position
  3) read direction (orientation)
are ALL identical.

Two pairs P1 and P2 are considered duplicates of each other, if
isDuplicate(P1.leftRead, P2.leftRead) and isDuplicate(P1.rightRead,
P2.rightRead).  Left vs right is determined by the unclipped 5'
position of each read in the pair.

Mapped pairs vs. Mapped-Unmapped pairs: For some read pairs, both
reads will be mapped (mapped pairs).  For other read pairs, only one
of the reads will be mapped (mapped-unmapped pairs).  A mapped pair
can be a duplicate of another mapped pair, but a mapped pair P1 may
NOT be a duplicate of a mapped-unmapped pair P2 because one read of
P2 will have no alignment position, and thus cannot be equal to one
of the mapped reads of P1.

However, the mapped read of a mapped-unmapped pair can be considered
a duplicate of one read on a mapped pair.  So in this example, P2.left
could be a duplicate of P1.left.  We call P2.left a "mate-unmapped read".

  P1: left(chr1, 1020, F) right(chr1, 1040, R)
  P2: left(chr1, 1020, F) right(chr1, 0, ?)

  P1 is not a duplicate of P2, but P2.left is a duplicate of P1.left.

After identifying the duplicates, this tool will select a primary
pair or read for each set of duplicates.  The primary will be the
duplicate with the highest score based on the sum of its base
qualities.  To break ties, a higher priority is given to reads that
appear earlier in the bam input.

In choosing a primary, pairs are given priority over mate-unmapped
reads.  So if a mate-unmapped read is found to be a duplicate of one
read in a mapped pair, the pair would always be the primary, and the
mate-unmapped read would always be the duplicate.  If two
mate-unmapped reads are duplicates of each other, but not duplicates
of a mapped pair, then the primary is chosen from the two
mate-unmapped reads.

After identifying the primary and the duplicates, this tool can be
configured to mark each read with the duplicate flag 1024, or to
remove each of the duplicate reads.

Tagging:

If the caller specifies the "tag-duplicates" parameter, the tool
will attach auxiliary tags DI, DL, DS, and DT to the output.

DI is the duplicate index of a duplicate set.  All pairs in a
duplicate set, including the primary, will have the same value for
DI; the value for DI is the file index of the left-most read of the
primary duplicate pair.  DI is not set for mate-unmapped reads.

DL is the number of library (LB aka PCR) duplicate pairs in the
duplicate set, including the primary. This is equal to the DS value
minus the number of "SQ" duplicates pairs in the duplicate set.

DS is the number of pairs in the duplicate set.  DS is not set on
mate-unmapped reads, and it also does not count mate-unmapped
duplicates.

DT is set on duplicate pairs (not the primary) and mate-unmapped
reads.  It is set to "SQ" for optical duplicates, and "LB" for all
other duplicates.

Implementation:

The implementation splits the input bam file into non-overlapping
shards, and processes each of those shards in separate workers, and
possibly in parallel with the other shards.  Each worker is
responsible for taking the reads in one shard, and outputting the
same reads in the same order (except for marking or removing
duplicates).

Matching up pairs:

To determine which pairs are duplicates, a worker must read both
reads in a pair to determine each read's 5' position and
orientation.  Each read record contains a pointer to its mate, but
it does not contain the orientation or the unclipped position which
is in the mate's flags and cigar respectively.

Matching up both reads in a pair is somewhat tricky.  For most of
the reads that live in a particular shard, the read's mate will
appear within that shard, and a worker can easily match up those
pairs.  However, some pairs will span from inside the shard to
outside the shard.  Most of these read pairs will have their reads
relatively close together, so the a worker will read an additional
pair-padding distance on either end of the shard for matching up
pairs, but even that is not enough.  Sometimes, a pair spans from
inside the shard to beyond the pair-padding; we'll call these
"distant-pairs".  To deal with distant-pairs, this tool pre-scans
the entire bam input, to identify and save the distant-pairs to
memory so that the workers can lookup the distant pair reads without
needing to seek, decompress, and read them from the bam file.  Note
that the distant-pairs allow all pairs to be resolved, so the
pair-padding is a memory optimization to avoid storing mates that
live in the pair-padding in memory for the duration of the entire
run.  Pair-padding should be chosen so that most mates will fall
into the pair-padding.

Matching up duplicates:

Duplicates are matched up using their 5' positions, and since a 5'
position can differ from the alignment's start position, each shard
needs some fuzziness at its boundaries to ensure all potential
duplicates will be compared against each other.  For example:

         shard1                  shard2
 |---------------------|-------------------------|
                    |cccc|--------------|         read1 (with clipping)
                    5    S              E         5', Begin, End

                    |c|-----------------|         read2 (with clipping)
                    5 S                 E         5', Begin, End

              clip-pad           shard2            clip-pad
           |-----------|-------------------------|-----------|

In this example, read1 and read2 are duplicates according to their 5'
position, but the two reads will reside in different shards based on
their alignment Begin positions.

To handle these overlap cases, each shard has "clip-padding" on each
side of the shard.  When searching for duplicates in each shard, the
worker considers each read that is either in the clip-padding or in
the shard.  So in the example, read1 is in shard2, and read2 is in
shard2's clip-padding, but the shard2 worker still compares the two
reads with each other, decides they are duplicates, and marks one as
a duplicate based on their scores.  Note that the worker for shard1
also compares read1 and read2 and also decides they are duplicates,
and marks one based on their scores.  Since the scoring is
deterministic, shard1 and shard2 agree on which read to mark as
duplicate.

Clip-padding and pair-padding serve different purposes.
Clip-padding is for correctness and must exceed the largest clip
distance in the input file.  Pair-padding is a memory optization.
The complete shard diagram looks like this:

 shard-pad  clip-pad            shard1            clip-pad   shard-pad
|---------|-----------|-------------------------|-----------|---------|

Output ordering:

As the workers complete each shard, they output the shard's marked
reads to an output queue that preserves the original order of the
shards.  A writer reads shards from the queue in order, and writes the
records to the bam file output.  The output queue has a maximum size,
and when full only allows a worker to insert the shard that the writer
currently needs.  This ensures that the queue does not grow too long
when a worker takes a long time to process a particular shard.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2019 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Index

Constants

View Source
const (

	// IlluminaReadName5Fields is the number of columns in a 5 field read name.
	IlluminaReadName5Fields = 5
	// IlluminaReadName5FieldsTileField is 0-based field number that
	// contains the tileName for 5 field read names.
	IlluminaReadName5FieldsTileField = 2

	// IlluminaReadName7Fields is the number of columns in a 5 field read name.
	IlluminaReadName7Fields = 7
	// IlluminaReadName7FieldsTileField is 0-based field number that
	// contains the tileName for 7 field read names.
	IlluminaReadName7FieldsTileField = 4

	// IlluminaReadName8Fields is the number of columns in a 5 field read name.
	IlluminaReadName8Fields = 8
	// IlluminaReadName8FieldsTileField is 0-based field number that
	// contains the tileName for 8 field read names.
	IlluminaReadName8FieldsTileField = 4
)

Variables

This section is empty.

Functions

func ChoosePrimary

func ChoosePrimary(entries []DuplicateEntry) int

func GetLibrary

func GetLibrary(readGroupLibrary map[string]string, record *sam.Record) string

GetLibrary returns the library for the given record's read group. If the library is not defined in readGroupLibrary, returns "Unknown Library".

func NewAux

func NewAux(name string, val interface{}) sam.Aux

func NewRecord

func NewRecord(name string, ref *sam.Reference, pos int, flags sam.Flags, matePos int, mateRef *sam.Reference, cigar sam.Cigar) *sam.Record

func NewRecordAux

func NewRecordAux(name string, ref *sam.Reference, pos int, flags sam.Flags, matePos int, mateRef *sam.Reference,
	cigar sam.Cigar, aux sam.Aux) *sam.Record

func NewRecordSeq

func NewRecordSeq(name string, ref *sam.Reference, pos int, flags sam.Flags, matePos int, mateRef *sam.Reference,
	cigar sam.Cigar, seq, qual string) *sam.Record

func NewTestOutput

func NewTestOutput(dir string, index int, format string) string

NewTestOutput returns different string filename for the different output formats.

func ReadRecords

func ReadRecords(t *testing.T, path string) []*sam.Record

ReadRecords reads the records from path and returns them as a slice, in order.

func RunTestCases

func RunTestCases(t *testing.T, header *sam.Header, cases []TestCase)

func SetupAndMark

func SetupAndMark(ctx context.Context, provider bamprovider.Provider, opts *Opts) error

SetupAndMark does some minimal setup for validating opts, and creating provider and then runs mark().

Types

type BagProcessor

type BagProcessor func([]*IntermediateDuplicateSet) []*IntermediateDuplicateSet

BagProcessor takes the set of bags from a particular shard, and returns the same set of reads, but the reads may now be bagged differently. The new set of bags may contain more bags or fewer bags than the original set of bags.

type BagProcessorFactory

type BagProcessorFactory interface {
	Create() BagProcessor
}

BagProcessorFactory creates BagProcessors. Mark-duplciates creates one BagProcessor per goroutine.

type DuplicateEntry

type DuplicateEntry interface {
	Name() string
	BaseQScore() int
	FileIdx() uint64
}

type IndexedPair

type IndexedPair struct {
	Left  IndexedSingle
	Right IndexedSingle
}

func (IndexedPair) BaseQScore

func (p IndexedPair) BaseQScore() int

func (IndexedPair) FileIdx

func (p IndexedPair) FileIdx() uint64

func (IndexedPair) GetR1R2

func (p IndexedPair) GetR1R2() (r1 *sam.Record, r2 *sam.Record)

func (IndexedPair) Name

func (p IndexedPair) Name() string

type IndexedSingle

type IndexedSingle struct {
	R        *sam.Record
	FileIdx_ uint64
}

func (IndexedSingle) BaseQScore

func (s IndexedSingle) BaseQScore() int

func (IndexedSingle) FileIdx

func (s IndexedSingle) FileIdx() uint64

func (IndexedSingle) Name

func (s IndexedSingle) Name() string

type IntermediateDuplicateSet

type IntermediateDuplicateSet struct {
	Pairs     []DuplicateEntry
	Singles   []DuplicateEntry
	Corrected map[string]string // Maps read name to corrected UMI pair: "GAC+GAG"
}

type MarkDuplicates

type MarkDuplicates struct {
	Provider bamprovider.Provider
	Opts     *Opts
	// contains filtered or unexported fields
}

MarkDuplicates implements duplicate marking.

func (*MarkDuplicates) Mark

func (m *MarkDuplicates) Mark(shards []bam.Shard) (*MetricsCollection, error)

Mark marks the duplicates, and returns metrics, and an error if encountered.

type Metrics

type Metrics struct {

	// UnpairedReads is the number of mapped reads examined which did
	// not have a mapped mate pair, either because the read is
	// unpaired, or the read is paired to an unmapped mate.
	UnpairedReads int

	// ReadPairsExamined is the number of mapped read pairs
	// examined. (Primary, non-supplemental).
	ReadPairsExamined int

	// SecondarySupplementary is the number of reads that were either
	// secondary or supplementary.
	SecondarySupplementary int

	// UnmappedReads is the total number of unmapped reads
	// examined. (Primary, non-supplemental).
	UnmappedReads int

	// UnpairedDups is the number of fragments that were marked as duplicates.
	UnpairedDups int

	// ReadPairDups is the number of read pairs that were marked as duplicates.
	ReadPairDups int

	// ReadPairOpticalDups is the number of read pairs duplicates that
	// were caused by optical duplication. Value is always <
	// READ_PAIR_DUPLICATES, which counts all duplicates regardless of
	// source.
	ReadPairOpticalDups int
}

Metrics contains metrics from mark duplicates.

func (*Metrics) Add

func (m *Metrics) Add(other *Metrics)

Add adds the metrics in other to m.

func (*Metrics) String

func (m *Metrics) String() string

String returns a string representation of the metrics contained in m. The string can be used as metrics file output.

type MetricsCollection

type MetricsCollection struct {

	// OpticalDistance stores the number of duplicate read pairs that
	// have the given Euclidean distance.
	OpticalDistance [][]int64

	// LibraryMetrics contains per-library metrics.
	LibraryMetrics map[string]*Metrics

	// High coverage intervals and read counts.
	HighCoverageIntervals []coverageInterval
	// contains filtered or unexported fields
}

MetricsCollection contains metrics computed by Mark.

func (*MetricsCollection) AddDistance

func (mc *MetricsCollection) AddDistance(bagSize, distance int)

AddDistance increments the histogram counter for the given bagsize and distance.

func (*MetricsCollection) AddHighCovInterval

func (mc *MetricsCollection) AddHighCovInterval(interval coverageInterval)

func (*MetricsCollection) Get

func (mc *MetricsCollection) Get(library string) *Metrics

Get returns Metrics for the given library. If there is no Metrics for library yet, create one and return it.

func (*MetricsCollection) Merge

func (mc *MetricsCollection) Merge(other *MetricsCollection)

Merge per-library and optical distance metrics from other into mc.

type OpticalDetector

type OpticalDetector interface {
	// GetRecordProcessor returns a RecordProcessor that sees every
	// record in the bam input before any calls to Detect happen. The
	// OpticalDetector can use this to calculate statistics that
	// influence optical detection.
	GetRecordProcessor() bampair.RecordProcessor

	// RecordProcessorsDone should return maximum X and Y co-ordinates and
	// be called after the RecordProcessors have seen all the input records.
	RecordProcessorsDone() (int, int)

	// Detect identifies the optical duplicates in pairs and returns
	// their names in a slice. readGroupLibrary maps readGroup to
	// library name. pairs contains all the readpairs in the bag, and
	// bestIndex is an index into pairs that points to the bag's
	// primary readpair.
	Detect(readGroupLibrary map[string]string, pairs []DuplicateEntry, bestIndex int) []string
}

OpticalDetector is a general interface for optical duplicate detection.

type Opts

type Opts struct {
	// Commandline options.
	BamFile                  string
	IndexFile                string
	MetricsFile              string
	HighCoverageIntervalFile string
	TileSizeFile             string
	Format                   string
	CoverageMax              int
	ShardSize                int
	MinBases                 int
	Padding                  int
	DiskMateShards           int
	ScratchDir               string
	Parallelism              int
	QueueLength              int
	ClearExisting            bool
	RemoveDups               bool
	TagDups                  bool
	IntDI                    bool
	UseUmis                  bool
	UmiFile                  string
	ScavengeUmis             int
	EmitUnmodifiedFields     bool
	SeparateSingletons       bool
	OutputPath               string
	StrandSpecific           bool
	OpticalHistogram         string
	OpticalHistogramMax      int
	Seed                     int64

	// Data and operators derived from commandline options.
	BagProcessorFactories []BagProcessorFactory
	OpticalDetector       OpticalDetector
	KnownUmis             []byte
}

Opts for mark-duplicates.

type Orientation

type Orientation uint8

func GetR1R2Orientation

func GetR1R2Orientation(p *IndexedPair) Orientation

GetR1R2Orientation returns an orientation byte containing orientations for both R1 and R2.

type PhysicalLocation

type PhysicalLocation struct {
	Lane       int
	Surface    int
	Swath      int
	Section    int
	TileNumber int
	TileName   int
	X          int
	Y          int
}

PhysicalLocation describes a read's physical location on the flow cell. Lane, Surface, Swatch, Section, and TileNumber together specify which flowcell tile the read was found in. TileName is the 4 or 5 digit representation of the tile, e.g. 1203 means surface 1, swath 2 and tile 3. 12304 means surface 1, swath 2, section 3, and tile 4. X and Y describe the X and Y coordinates of the well within the tile.

func ParseLocation

func ParseLocation(qname string) PhysicalLocation

ParseLocation returns a physical location given an Illumina style read name. The read name must have 5, 7, or 8 fields separated by ':'. When there are 5 or 7 fields, the last three fields are tileName, X and Y. When there are 8 fields, the last four fields are tileName, X, Y, and UMI.

The tileName be formatted as a 4 or 5 digit Illumina tileName. For a description of 4 digit tile numbers, see Appendix B, section Tile Numbering in

http://support.illumina.com.cn/content/dam/illumina-support/documents/documentation/system_documentation/hiseqx/hiseq-x-system-guide-15050091-e.pdf

For a description of 5 digit tile numbers, see Appendix C, section Tile Numbering in

https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/nextseq/nextseq-550-system-guide-15069765-05.pdf

type TestCase

type TestCase struct {
	TRecords []TestRecord
	Opts     Opts
}

type TestRecord

type TestRecord struct {
	R              *sam.Record
	DupFlag        bool
	ExpectedAuxs   []sam.Aux
	UnexpectedTags []sam.Tag
}

type TileOpticalDetector

type TileOpticalDetector struct {
	OpticalDistance int
}

TileOpticalDetector detects optical duplicates with a tile. For two reads to be optical duplicates, their tile, lane, surface, library, and read orientations must be identical

func (*TileOpticalDetector) Detect

func (t *TileOpticalDetector) Detect(readGroupLibrary map[string]string, duplicates []DuplicateEntry, bestIndex int) []string

Detect implements OpticalDetector.

func (*TileOpticalDetector) GetRecordProcessor

func (t *TileOpticalDetector) GetRecordProcessor() bampair.RecordProcessor

GetRecordProcessor implements OpticalDetector.

func (*TileOpticalDetector) RecordProcessorsDone

func (t *TileOpticalDetector) RecordProcessorsDone() (int, int)

RecordProcessorsDone implements OpticalDetector.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL