cablastp

package module
v0.0.0-...-77647fb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2015 License: GPL-2.0 Imports: 18 Imported by: 5

README

ABOUT
=====
CaBLASTP is a family of programs for performing compressively-accelerated
protein sequence searches based on the BLASTP family of tools (including 
PSI-BLAST and DELTA-BLAST), as well as a compression tool (cablastp-compress)
for creating searchable, compressed databases based on an input FASTA file.

If you use CaBLASTP, please cite:
Daniels N, Gallant A, Peng J, Cowen L, Baym M, Berger B
(2013) Compressive Genomics for Protein Databases. 
Submitted for publication.

CaBLASTP is licensed under the GNU public license version 2.0. If you would
like to license CaBLASTP in an environment where the GNU public license is
unacceptable (such as inclusion in a non-GPL software package) commercial
CaBLASTP licensing is available through MIT office of Technology Transfer.
Contact bab@mit.edu for more information.
Contact ndaniels@cs.tufts.edu for issues involving the code.


QUICK EXAMPLE
=============
Assuming you have Go and BLAST+ installed, here is a quick example of how to 
perform a compressively accelerated BLASTP search using a compressed database 
that has already been created.

    # Install CaBLASTP
    go get github.com/BurntSushi/cablastp/...

    # Download and extract the database. It is large and could take a while.
    # Make sure to check for a newer version!
    wget http://groups.csail.mit.edu/cb/cablastp/cablastp-nr20121212.tar.gz
    tar zxf cablastp-nr20121212.tar.gz

    # Compressive BLAST search.
    cablastp-search cablastp-nr20121212 query.fasta

There are more examples covering more use cases further down.


INSTALLATION
============
The easiest way to install is to download binaries compiled for your operating
system. No other dependencies are required (sans BLASTP+).
They can be downloaded here: http://groups.csail.mit.edu/cb/cablastp

Compiling from source is also easy; compiling CaBLASTP only requires that git 
and Go are installed. If Go is not already available via your package manager,
it can be installed from source by following the directions here:
http://golang.org/doc/install

Once Go is installed, you'll need to set your GOPATH, which is where CaBLASTP 
(and other Go packages) will be installed. We recommend running

    mkdir $HOME/go

And adding the following to your `~/.profile` or equivalent:

    export GOPATH="$HOME/go"
    export PATH="$PATH:$GOPATH/bin"

Finally, run the following command to download, compile and install CaBLASTP:

    go get github.com/BurntSushi/cablastp/...

The CaBLASTP executables should be installed in `$GOPATH/bin`.

CaBLAST has been tested against Go 1.x.


EXECUTABLES
===========
There are five binary executables in the CaBLASTP suite, also available as 
binaries for users without Go installed. They are: 

    cablastp-compress     Compresses FASTA input files (such as nr.fasta or
                          nr.gz) into a compressed database for quick searching.

    cablastp-decompress   A rarely-needed inverse of cablastp-compress.

    cablastp-search       A compressively accelerated version of BLASTP.

    cablastp-psisearch    A compressively accelerated version of PSI-BLAST.

    cablastp-deltasearch  A compressively accelerated version of DELTA-BLAST.

Every executable can be run with the `--help` flag to get a list of command 
line options.


PREREQUISITES
=============
CaBLASTP boosts BLAST+ protein search, and as such it is not completely
self-contained. It relies on BLAST+.

To use CaBLASTP, you must already have BLAST+ 2.2 or later installed, so that
the BLAST binaries are in your PATH. DELTA-BLAST requires BLAST+ 2.2.26 or 
later and we recommend 2.2.27. DELTA-BLAST also requires an RPS database 
configured per NCBI's instructions.

We provide binaries for Mac OS X (64-bit intel, tested on OS X 10.8.2 and
built with Go 1.0.3) and Linux (64-bit intel/AMD, tested on Linux kernel 3.6.1 
and Go 1.1.1). With Go installed, CaBLASTP should work on Microsoft Windows but 
is untested.

You do not need the Go compiler installed to use the binary distributions of
CaBLASTP.


ADDITIONAL FILES
================
As compression is compute-intensive, we provide an already-compressed database
based on NCBI's NR from December 12, 2012, which we will update thrice yearly.
Since the CaBLASTP compressed database format is actually a directory 
structure, we provide it as a .tar.gz file, so should be unarchived with 
`tar zxf cablastp-nr20121212.tar.gz`.

The result will be a directory, 'cablastp-nr20121212', which contains the 
various files necessary for CaBLASTP to run.

Should you wish to create your own compressed database, you would use the
cablastp-compress binary. The database we provide was created with:

    cablastp-compress --ext-seed-size 0 --match-seq-id-threshold 70 
                      --ext-seq-id-threshold 60 --max-seeds 20 -p 40
                      cablastp-nr20121212 nr.fasta

Several of the command-line arguments are tuning parameters that affect the
run-time performance of compression.

The --max-seeds argument caps the size of the seeds table to, in this case, 20
gigabytes. Compressing large databases can require a great deal of RAM. A
significantly smaller cap will harm compression.

The --ext-seed-size argument allows for larger k-mer seeds without the memory
overhead associated with the larger size, by greedily requiring the additional
residues to be exact matches.

The --match-seq-id-threshold argument sets the sequence identity percentage
required for a match during compression.

The --ext-seq-id-threshold argument sets the sequence identity percentage
required for a single instance of extension during compression.

The -p argument simply sets the number of processor cores used during 
compression, and bears no relevance to the resulting compressed database.

In this case, the input file is `nr.fasta`, and the output name for the
compressed database is `cablastp-nr20121212`.
Note that the compressed database is actually a directory that will be created
by `cablastp-compress`.


USAGE
=====
Run cablastp-compress -help, cablastp-deltasearch -help, cablastp-search -help, 
or cablastp-psisearch -help for detailed help as to command-line arguments.


EXAMPLES
========
To perform a compressively accelerated DELTA-BLAST search, you might do:

    cablastp-deltasearch -rpspath /path/to/cdd_delta
                         /path/to/cablastp_database /path/to/query.fasta

where:

    /path/to/cdd_delta is the local file path to your conserved domain
    database (required for standard delta-blast as well)

    /path/to/cablastp_database is the local file path to your cablastp 
    compressed database (it will be the path to cablastp-nr20121212 if you are 
    using the provided December, 2012 database)

    /path/to/query.fasta is simply the local file path to the FASTA file you 
    wish to use as a query.

To perform a compressively accelerated BLASTP search, you might do:

    cablastp-search /path/to/cablastp_database /path/to/query.fasta

where:

    /path/to/cablastp_database is the local file path to the cablastp 
    compressed database, and

    /path/to/query.fasta is the local file path to the FASTA file you wish to 
    use as a query.

Arguments the user wishes to pass to the underlying BLAST program, such as
adjusting the output format or the E-value threshold, may be passed via the
`--blast-args` flag.

For example, to specify XML output, one might run:

    cablastp-search /path/to/cablastp_database /path/to/query.fasta
                    --blast-args -outfmt 5

Where `-outfmt 5` is, as indicated in the NCBI blastp user guide, the 
command-line argument for XML output.


REPORTING BUGS
==============
If you find any bugs or have any problems using CaBLASTP, please submit a bug
report on our issue tracker:

    https://github.com/BurntSushi/cablastp/issues

Documentation

Index

Constants

View Source
const (
	FileCoarseFasta      = "coarse.fasta"
	FileCoarseFastaIndex = "coarse.fasta.index"
	FileCoarseLinks      = "coarse.links"
	FileCoarsePlainLinks = "coarse.links.plain"
	FileCoarseLinksIndex = "coarse.links.index"
	FileCoarseSeeds      = "coarse.seeds"
	FileCoarsePlainSeeds = "coarse.seeds.plain"
)

Hard-coded file names for different pieces of a cablastp database.

View Source
const (
	FileCompressed = "compressed"
	FileIndex      = "compressed.index"
)
View Source
const (
	FileParams      = "params"
	FileBlastCoarse = "blastdb-coarse"
	FileBlastFine   = "blastdb-fine"
)
View Source
const (
	ModSubstitution = iota
	ModDeletion
	ModInsertion
)

Variables

View Source
var (
	SeedAlphaSize        = len(blosum.Alphabet62)
	SeedAlphaNums        = make([]int, 26)
	ReverseSeedAlphaNums = make([]byte, 26)
)

SeedAlphaNums is a map to assign *valid* amino acid resiudes contiunous values so that base-N arithmetic can be performed on them. (Where N = SeedAlphaSize.) Invalid amino acid resiudes map to -1 and will produce a panic.

View Source
var DefaultDBConf = &DBConf{
	MinMatchLen:         40,
	MatchKmerSize:       4,
	GappedWindowSize:    25,
	UngappedWindowSize:  10,
	ExtSeqIdThreshold:   60,
	MatchSeqIdThreshold: 70,
	MatchExtend:         30,
	MapSeedSize:         6,
	ExtSeedSize:         0,
	LowComplexity:       10,
	SeedLowComplexity:   6,
	SavePlain:           false,
	ReadOnly:            true,
	BlastMakeBlastDB:    "makeblastdb",
	BlastDBSize:         0,
}
View Source
var (
	Verbose = false
)

Functions

func Exec

func Exec(cmd *exec.Cmd) error

Exec runs a command created with 'Command' in the os/exec package, and converts anything reported to stderr to a Go error value.

Note that if the command returns successfully, the error is guaranteed to be nil.

func IsLowComplexity

func IsLowComplexity(residues []byte, offset, window int) bool

IsLowComplexity detects whether the residue at the given offset is in a region of low complexity, where low complexity is defined as a window where every residue is the same (no variation in composition).

func PrintFlagDefaults

func PrintFlagDefaults()

func ReadOriginalSeqs

func ReadOriginalSeqs(
	fileName string,
	ignore []byte,
) (chan ReadOriginalSeq, error)

ReadOriginalSeqs reads a FASTA formatted file and returns a channel that each new sequence is sent to.

func SeqIdentity

func SeqIdentity(seq1, seq2 []byte) int

SeqIdentity computes the sequence identity of two byte slices. The number returned is an integer in the range 0-100, inclusive. SeqIdentity returns zero if the lengths of both seq1 and seq2 are zero.

If the lengths of seq1 and seq2 are not equal, SeqIdentity will panic.

func Vprint

func Vprint(s string)

func Vprintf

func Vprintf(format string, v ...interface{})

func Vprintln

func Vprintln(s string)

Types

type CoarseDB

type CoarseDB struct {
	Seqs  []*CoarseSeq
	Seeds Seeds

	// File pointers to each file in the "coarse" part of a cablastp database.
	FileFasta      *os.File
	FileFastaIndex *os.File
	FileSeeds      *os.File
	FileLinks      *os.File
	FileLinksIndex *os.File
	// contains filtered or unexported fields
}

CoarseDB represents a set of unique sequences that comprise the "coarse" database. Sequences in the coarse database, combined with information in the compressed database, are used to re-create the original sequences.

func (*CoarseDB) Add

func (coarsedb *CoarseDB) Add(oseq []byte) (int, *CoarseSeq)

Add takes an original sequence, converts it to a coarse sequence, and adds it as a new coarse sequence to the coarse database. Seeds are also generated for each K-mer in the sequence. The resulting coarse sequence is returned along with its sequence identifier.

func (*CoarseDB) CoarseSeqGet

func (coarsedb *CoarseDB) CoarseSeqGet(i uint) *CoarseSeq

CoarseSeqGet is a thread-safe way to retrieve a sequence with index `i` from the coarse database.

func (*CoarseDB) Expand

func (coarsedb *CoarseDB) Expand(
	comdb *CompressedDB, id, start, end int) ([]OriginalSeq, error)

Expand will follow all links to compressed sequences for the coarse sequence at index `id` and return a slice of decompressed sequences.

func (*CoarseDB) NumSequences

func (coarsedb *CoarseDB) NumSequences() int

NumRequences returns the number of sequences in the coarse database based on the file size of the coarse database index.

func (*CoarseDB) ReadCoarseSeq

func (coarsedb *CoarseDB) ReadCoarseSeq(id int) (*CoarseSeq, error)

ReadCoarseSeq reads the coarse sequence with identifier 'id' from disk, using the fasta index. (If a coarse sequence has already been read, it is returned from cache to save trips to disk.)

TODO: Note that this does *not* recover links typically found in a coarse sequence, although it probably should to avoid doing it in CoarseDB.Expand.

type CoarseSeq

type CoarseSeq struct {
	Links *LinkToCompressed
	// contains filtered or unexported fields
}

referenceSeq embeds a sequence and serves as a typing mechanism to distguish reference sequences in the compressed database with original sequences from the input FASTA file.

func NewCoarseSeq

func NewCoarseSeq(id int, name string, residues []byte) *CoarseSeq

func NewFastaCoarseSeq

func NewFastaCoarseSeq(id int, s seq.Sequence) *CoarseSeq
func (rseq *CoarseSeq) AddLink(link *LinkToCompressed)

func (CoarseSeq) FastaSeq

func (s CoarseSeq) FastaSeq() seq.Sequence

FastaSeq returns a new seq.Sequence from TuftsBCB/seq.

func (CoarseSeq) Len

func (seq CoarseSeq) Len() int

Len retuns the number of residues in this sequence.

func (*CoarseSeq) NewSubSequence

func (rseq *CoarseSeq) NewSubSequence(start, end uint) *CoarseSeq

func (CoarseSeq) String

func (seq CoarseSeq) String() string

String returns a string (fasta) representation of this sequence. If this sequence is a subsequence, then the range of the subsequence (with respect to the original sequence) is also printed.

type CompressedDB

type CompressedDB struct {
	// File pointers to be used in reading/writing compressed databases.
	File  *os.File
	Index *os.File
	// contains filtered or unexported fields
}

A CompressedDB corresponds to a list of all original sequences compressed by replacing regions of sequences that are redundant with pointers to similar regions in the coarse database. Each pointer includes an offset and an edit script, which allows complete recovery of the original sequence.

N.B. A compressed database doesn't keep an in memory representation of all compressed sequences. In particular, writing to a compressed database always corresponds to writing a compressed sequence to disk. And reading from a compressed database always corresponds to reading a sequence from disk (unless it has been cached in 'seqCache').

func (*CompressedDB) NumSequences

func (comdb *CompressedDB) NumSequences() int

NumSequences returns the number of sequences in the compressed database using the file size of the index.

func (*CompressedDB) ReadNextSeq

func (comdb *CompressedDB) ReadNextSeq(
	coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)

func (*CompressedDB) ReadSeq

func (comdb *CompressedDB) ReadSeq(
	coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)

func (*CompressedDB) SeqGet

func (comdb *CompressedDB) SeqGet(
	coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)

SeqGet reads a sequence from the compressed database, and decompressed it using the coarse database provided. The decompressed sequence is then added to cache.

If the sequence has already been decompressed, the decompressed sequence from cache is returned.

SeqGet will panic if it is called while a compressed database is open for writing.

func (*CompressedDB) Write

func (comdb *CompressedDB) Write(cseq CompressedSeq)

Write queues a new compressed sequence to be written to disk.

type CompressedSeq

type CompressedSeq struct {
	// A sequence number.
	Id int

	// Name is an uncompressed string from the original FASTA header.
	Name string

	// Links is an ordered lists of links to portions of the reference
	// database. When all links are followed, the concatenation of each
	// sequence corresponding to each link equals the entire original sequence.
	Links []LinkToCoarse
}

CompressedSeq corresponds to the components of a compressed sequence.

func NewCompressedSeq

func NewCompressedSeq(id int, name string) CompressedSeq

NewCompressedSeq creates a CompressedSeq value using the name provided. The Link slice is initialized but empty.

func (*CompressedSeq) Add

func (cseq *CompressedSeq) Add(link LinkToCoarse)

Add will add a LinkToCoarse to the end of the CompressedSeq's Links list.

func (CompressedSeq) Decompress

func (cseq CompressedSeq) Decompress(coarse *CoarseDB) (OriginalSeq, error)

Decompress decompresses a particular compressed sequence using the given coarse sequence. Namely, all of the links are followed and all of the edit scripts are "applied" to recover the original sequence.

func (CompressedSeq) String

func (cseq CompressedSeq) String() string

type DB

type DB struct {
	// An embedded configuration.
	*DBConf

	// The path to the directory on disk.
	Path string

	// The name of the database. This corresponds to the basename of the path.
	Name string

	// The compressed database component.
	ComDB *CompressedDB

	// The coarse database component.
	CoarseDB *CoarseDB
	// contains filtered or unexported fields
}

A DB represents a cablastp database, which has three main components: a coarse database, a compressed database and a configuration file.

A DB can be opened either for writing/appending (compression) or for reading (decompression).

func NewReadDB

func NewReadDB(dir string) (*DB, error)

NewReadDB opens a cablastp database for reading. An error is returned if there is a problem accessing any of the files on disk.

Also, if the 'makeblastdb' or 'blastp' executales are not found, then an error is returned.

func NewWriteDB

func NewWriteDB(appnd bool, conf *DBConf, dir string) (*DB, error)

NewWriteDB creates a new cablastp database, and prepares it for writing (or opens an existing database and prepares it for appending if 'appnd' is set).

An error is returned if there is a problem accessing any of the files in the database.

It is an error to open a database for writing that already exists if 'appnd' is not set.

'conf' should be a database configuration, typically defined (initially) from command line parameters. Note that if 'appnd' is set, then the configuration will be read from disk---only options explicitly set via the command line will be overwritten.

func (*DB) ReadClose

func (db *DB) ReadClose()

ReadClose closes all appropriate files after reading from a database.

func (*DB) Save

func (db *DB) Save() error

Save will write the contents of the database to disk. This should be called after compression is complete.

After the database is saved, a blastp database is created from the coarse database.

N.B. The compressed database is written as each sequence is processed, so this call will only save the coarse database. This may take a *very* long time if the database is not read only (since the seeds table has to be written).

func (*DB) WriteClose

func (db *DB) WriteClose()

WriteClose closes all appropriate files after writing to a database.

type DBConf

type DBConf struct {
	MinMatchLen         int
	MatchKmerSize       int
	GappedWindowSize    int
	UngappedWindowSize  int
	ExtSeqIdThreshold   int
	MatchSeqIdThreshold int
	MatchExtend         int
	MapSeedSize         int
	ExtSeedSize         int
	LowComplexity       int
	SeedLowComplexity   int
	SavePlain           bool
	ReadOnly            bool
	BlastMakeBlastDB    string
	BlastDBSize         uint64
}

func LoadDBConf

func LoadDBConf(r io.Reader) (conf *DBConf, err error)

func (*DBConf) FlagMerge

func (flagConf *DBConf) FlagMerge(fileConf *DBConf) (*DBConf, error)

func (DBConf) Write

func (dbConf DBConf) Write(w io.Writer) error

type EditScript

type EditScript struct {
	// contains filtered or unexported fields
}

func NewEditScript

func NewEditScript(alignment [2][]byte) *EditScript

func NewEditScriptParse

func NewEditScriptParse(editScript string) (*EditScript, error)

func (*EditScript) Apply

func (diff *EditScript) Apply(fromSeq []byte) []byte

func (*EditScript) String

func (diff *EditScript) String() string

type LinkToCoarse

type LinkToCoarse struct {
	// Diff, when "applied" to the porition of the reference sequence indicated
	// by this link, will yield the original sequence corresponding to this
	// link precisely. If Diff is empty, then the subsequence of the reference
	// sequence indicated here is equivalent to the corresponding piece of
	// the original sequence.
	Diff                   string
	CoarseSeqId            uint
	CoarseStart, CoarseEnd uint16
}

LinkToCoarse represents a component of a compressed original sequence that allows perfect reconstruction (i.e., decompression) of the original sequence.

func NewLinkToCoarse

func NewLinkToCoarse(coarseSeqId, coarseStart, coarseEnd uint,
	alignment [2][]byte) LinkToCoarse

func NewLinkToCoarseNoDiff

func NewLinkToCoarseNoDiff(
	coarseSeqId, coarseStart, coarseEnd uint) LinkToCoarse

func (LinkToCoarse) String

func (lk LinkToCoarse) String() string

type LinkToCompressed

type LinkToCompressed struct {
	OrgSeqId               uint32
	CoarseStart, CoarseEnd uint16
	Next                   *LinkToCompressed
}

LinkToCompressed represents a link from a reference sequence to a compressed original sequence. It serves as a bridge from a BLAST hit in the coarse database to the corresponding original sequence that is redundant to the specified residue range in the reference sequence.

func NewLinkToCompressed

func NewLinkToCompressed(
	orgSeqId uint32, coarseStart, coarseEnd uint16) *LinkToCompressed

func (LinkToCompressed) String

func (lk LinkToCompressed) String() string

type OriginalSeq

type OriginalSeq struct {
	// contains filtered or unexported fields
}

OriginalSeq embeds a sequence and serves as a typing mechanism to distguish reference sequences in the compressed database with original sequences from the input FASTA file.

func NewFastaOriginalSeq

func NewFastaOriginalSeq(id int, s seq.Sequence) *OriginalSeq

func NewOriginalSeq

func NewOriginalSeq(id int, name string, residues []byte) *OriginalSeq

func (OriginalSeq) FastaSeq

func (s OriginalSeq) FastaSeq() seq.Sequence

FastaSeq returns a new seq.Sequence from TuftsBCB/seq.

func (OriginalSeq) Len

func (seq OriginalSeq) Len() int

Len retuns the number of residues in this sequence.

func (*OriginalSeq) NewSubSequence

func (oseq *OriginalSeq) NewSubSequence(start, end uint) *OriginalSeq

func (OriginalSeq) String

func (seq OriginalSeq) String() string

String returns a string (fasta) representation of this sequence. If this sequence is a subsequence, then the range of the subsequence (with respect to the original sequence) is also printed.

type ReadOriginalSeq

type ReadOriginalSeq struct {
	Seq *OriginalSeq
	Err error
}

ReadOriginalSeq is the value sent over `chan ReadOriginalSeq` when a new sequence is read from a fasta file.

type SeedLoc

type SeedLoc struct {
	// Index into the coarse database sequence slice.
	SeqInd uint32

	// Index into the coarse sequence corresponding to `SeqInd`.
	ResInd uint16

	Next *SeedLoc
}

SeedLoc represents the information required to translate a seed to a slice of residues from the coarse database. Namely, the index of the sequence in the coarse database and the index of the residue where the seed starts in that sequence.

Every SeedLoc also contains a pointer to the next seed location. This design was chosen so that each SeedLoc is independently allocated (as opposed to using a slice, which incurs a lot of allocation overhead when expanding the slice, and has the potential for pinning memory).

func NewSeedLoc

func NewSeedLoc(seqInd uint32, resInd uint16) *SeedLoc

type Seeds

type Seeds struct {
	// Table of lists of seed locations. Its length is always equivalent
	// to (SeedAlphaSize)^(SeedSize).
	Locs []*SeedLoc

	SeedSize int
	// contains filtered or unexported fields
}

Seeds is a list of lists of seed locations. The index into the seeds table corresponds to a hash of particular K-mer. The list found at each row in the seed table corresponds to all locations in the coarse database in which the K-mer occurs.

func NewSeeds

func NewSeeds(seedSize, lowComplexityWindow int) Seeds

NewSeeds creates a new table of seed location lists. The table is initialized with enough memory to hold lists for all possible K-mers. Namely, the length of seeds is equivalent to 20^(K) where 20 is the number of amino acids (size of alphabet) and K is equivalent to the length of each K-mer.

func (*Seeds) Add

func (ss *Seeds) Add(coarseSeqIndex int, corSeq *CoarseSeq)

Add will create seed locations for all K-mers in corSeq and add them to the seeds table.

func (Seeds) Lookup

func (ss Seeds) Lookup(kmer []byte, mem *[][2]uint) [][2]uint

Lookup returns a list of all seed locations corresponding to a particular K-mer.

`mem` is a pointer to a slice of seed locations, where a seed location is a tuple of (sequence index, residue index). `mem` is used to prevent unnecessary allocation. A pointer to thise slice is returned.

func (*Seeds) MaybeWipe

func (ss *Seeds) MaybeWipe(seedTableSizeGB float64)

MaybeWipe completely wipes the seeds table if the memory of the seeds table exceeds seedTableSizeGB (which is the number of gigabytes).

func (Seeds) NumSeeds

func (ss Seeds) NumSeeds() int64

NumSeeds computes the number of seeds currently in the seeds table. Since the seeds table is typically big, this is an expensive operation.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL