irelate

package module
v0.0.0-...-5c4d9d6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 4, 2021 License: MIT Imports: 12 Imported by: 0

README

irelate.go

Streaming relation (overlap, distance, KNN) testing of (any number of) sorted files of intervals.

Currently supports BED, BAM, GFF, VCF.

![GoDoc] (https://godoc.org/github.com/jakehagen/irelate?status.png) Build Status Coverage Status

Motivation

We want to relate (e.g. intersect or by distance) sets of intervals. For example, we may want to report the nearest gene to a set of ChIP-Seq peaks. BEDTools does this extremely well, irelate is an attempt to provide an API so that users can write their own tools with little effort in go.

Design

  • data-sources must support the Relatable Interface. (we provide parsers for common formats).
  • a user-defined function returns true if 2 Relatable's are related. (only a small number of interval-pairs are sent to be tested--this is handled automatically by IRelate.). We provide CheckRelatedByOverlap to perform overlap testing.
  • i.Related() gives access to all of the related intervals (after they are added internally by IRelate)
  • the "API" is a for loop
  • A parallel chrom-sweep algorithm is used that avoids problems with chromosome order and parallelizes nicely up to about a dozen CPUs.

Example

(also see main/main.go which is similar to bedtools intersect -sorted -sortout -c)

print the number of b alignments that overlap an interval in a


// CheckRelatedByOverlap returns true if Relatables overlap.
func CheckRelatedByOverlap(a Relatable, b Relatable) bool {
        // note with distance == 0 this just overlap.
        return (b.Start() < a.End()) && (b.Chrom() == a.Chrom())
}

// determine ordering of Relatables.
func Less(a Relatable, b Relatable) bool {
    if a.Chrom() != b.Chrom() {
        return a.Chrom() < b.Chrom()
    }
    return a.Start() < b.Start() // || (a.Start() == b.Start() && a.End() < b.End())
}



// a and b are channels that send Relatables.
a, _ := bix.New('intervals.bed.gz')
b, _ := bix.New('some.vcf.gz')
for interval := range IRelate(CheckRelatedByOverlap, 0, Less, a, b) {
    fmt.Fprintf("%s\t%d\t%d\t%d\n", interval.Chrom(), interval.Start(), interval.End(), len(interval.Related()))
}

The 2nd argument determines the query set of intervals. So, only intervals from a (the 0th) source will be sent from IRelate. If this is set to -1, then all intervals from all sources will be sent. After this, any number of interval streams can be passed to IRelate

If we only want to count variants with a given mapping quality, the loop becomes:

for interval := range IRelate(CheckRelatedByOverlap, 0, Less, a, b) {
    n := 0
    for _, b := range interval.Related() {
         // cast to a bam to ge the mapping quality.
         if int(b.(*Variant).Score()) > 20 {
             n += 1
         }
    }
    fmt.Fprintf("%s\t%d\t%d\t%d\n", interval.Chrom(), interval.Start(), interval.End(), n))
}


note that any number of interval sources are supported even though the example is with 2. We can see the source of each interval with: interval.Source(). That value is set automatically inside of irelate.

This is a very simple example, but the point of this is that since the interface is a simple function (as in CheckRelatedByOverlap) and a for loop, it is easy to create custom applications.

For example, here is the function to relate all intervals within 2KB:

// CheckRelatedBy2KB returns true if intervals are within 2KB.
func CheckRelatedBy2KB(a Relatable, b Relatable) bool {
        distance := uint32(2000)
        // note with distance == 0 this just overlap.
        return (b.Start()-distance < a.End()) && (b.Chrom() == a.Chrom())
}

Note that we are guaranteed that b.Start() >= a.Start() so the check is quite simple.

Relatable

a key interface in irelated is:

// Relatable provides all the methods for irelate to function.
// See Interval in interval.go for a class that satisfies this interface.
// Related() likely returns and AddRelated() likely appends to a slice of
// relatables. Note that for performance reasons, Relatable should be implemented
// as a pointer to your data-structure (see Interval).
type Relatable interface {
        Chrom() string
        Start() uint32
        End() uint32
        Related() []Relatable // A slice of related Relatable's filled by IRelate
        AddRelated(Relatable) // Adds to the slice of relatables
        SetSource() uint32    // Internally marks the source (file/stream) of the Relatable
}

Performance

irelate is quite fast, but use PIRelate for parallel intersection. It is less flexible than irelate, but skips parsing of database intervals for sparse regions in the query. In addition, it has very good (automatic) parallelization.

Documentation

Overview

Streaming relation (overlap, distance, KNN) testing of (any number of) sorted files of intervals.

Index

Constants

View Source
const MaxInt32 = int(MaxUint32 >> 1)
View Source
const MaxUint32 = ^uint32(0)
View Source
const SelfRelations = -2

Set relativeTo so SelfRelations constant to allow reporting overlaps within a stream

Variables

This section is empty.

Functions

func AsQueryable

func AsQueryable(f string) (interfaces.Queryable, error)

func CheckKNN

func CheckKNN(a Relatable, b Relatable) bool

CheckKNN relates an interval to its k-nearest neighbors. The reporting function will have to do some filtering since this is only guaranteed to associate *at least* k neighbors, but it could be returning extra.

func CheckOverlapPrefix

func CheckOverlapPrefix(a Relatable, b Relatable) bool

handles chromomomes like 'chr1' from one org and '1' from another.

func CheckRelatedByOverlap

func CheckRelatedByOverlap(a Relatable, b Relatable) bool

CheckRelatedByOverlap returns true if Relatables overlap.

func IRelate

func IRelate(checkRelated func(a, b Relatable) bool,
	relativeTo int,
	less func(a, b Relatable) bool,
	streams ...RelatableIterator) RelatableIterator

IRelate provides the basis for flexible overlap/proximity/k-nearest neighbor testing. IRelate receives merged, ordered Relatables via stream and takes function that checks if they are related (see CheckRelatedByOverlap). It is guaranteed that !Less(b, a) is true (we can't guarantee that Less(a, b) is true since they may have the same start). Once checkRelated returns false, it is assumed that no other `b` Relatables could possibly be related to `a` and so `a` is sent to the returnQ. streams are a variable number of iterators that send intervals.

func Less

func Less(a Relatable, b Relatable) bool

func LessPrefix

func LessPrefix(a Relatable, b Relatable) bool

1, 10, 11... 19, 2, 20, 21 ...

func NaturalLess

func NaturalLess(str1, str2 string) bool

NaturalLess compares two strings using natural ordering. This means that e.g. "abc2" < "abc12".

Non-digit sequences and numbers are compared separately. The former are compared bytewise, while the latter are compared numerically (except that the number of leading zeros is used as a tie-breaker, so e.g. "2" < "02")

Limitation: only ASCII digits (0-9) are considered.

func NaturalLessPrefix

func NaturalLessPrefix(a Relatable, b Relatable) bool

1, 2, 3 ... 9, 10, 11...

func PIRelate

func PIRelate(chunk int, maxGap int, qstream interfaces.RelatableIterator, ciExtend bool, fn func(interfaces.Relatable), dbs ...interfaces.Queryable) interfaces.RelatableChannel

PIRelate implements a parallel IRelate

func RegionToParts

func RegionToParts(region string) (string, int, int, error)

Types

This section is empty.

Directories

Path Synopsis
interfaces for genomic relations irelate operates on "Relatable"s which keep a slice of related intervals.
interfaces for genomic relations irelate operates on "Relatable"s which keep a slice of related intervals.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL