biosimd

package
v0.0.0-...-d966d87 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 18, 2020 License: Apache-2.0 Imports: 5 Imported by: 1

Documentation

Overview

Package biosimd provides access to SIMD-based implementations of several common .bam/.fa/etc.-specific operations on byte arrays which the compiler cannot be trusted to autovectorize within the next several years.

See base/simd/doc.go for more comments on the overall design.

Index

Constants

View Source
const BytesPerWord = simd.BytesPerWord

BytesPerWord is the number of bytes in a machine word.

View Source
const Log2BytesPerWord = simd.Log2BytesPerWord

Log2BytesPerWord is log2(BytesPerWord). This is relevant for manual bit-shifting when we know that's a safe way to divide and the compiler does not (e.g. dividend is of signed int type).

Variables

View Source
var (
	// SeqASCIITable maps 4-bit seq[] values to their ASCII representations.
	// It's a common argument for UnpackAndReplaceSeq().
	SeqASCIITable = MakeNibbleLookupTable([16]byte{'=', 'A', 'C', 'M', 'G', 'R', 'S', 'V', 'T', 'W', 'Y', 'H', 'K', 'D', 'B', 'N'})
)

Functions

func ASCIITo2bit

func ASCIITo2bit(dst, src []byte)

ASCIITo2bit sets the bytes in dst[] as follows:

if pos is congruent to 0 mod 4, little-endian bits 0-1 of dst[pos / 4] :=
  0 if src[pos] == 'A'/'a'
  1 if src[pos] == 'C'/'c'
  2 if src[pos] == 'G'/'g'
  3 if src[pos] == 'T'/'t'
similarly, if pos is congruent to 1 mod 4, src[pos] controls bits 2-3 of
dst[pos / 4], etc.
trailing high bits of the last byte are set to zero.

It panics if len(dst) != (len(src) + 3) / 4.

WARNING: This does not verify that all input characters are in {'A', 'C', 'G', 'T', 'a', 'c', 'g', 't'}. Results are arbitrary if any input characters are invalid, though the function is still memory-safe in that event.

func ASCIIToSeq8

func ASCIIToSeq8(dst, src []byte)

ASCIIToSeq8 sets dst[pos] as follows:

src[pos] == 'A'/'a': dst[pos] == 1
src[pos] == 'C'/'c': dst[pos] == 2
src[pos] == 'G'/'g': dst[pos] == 4
src[pos] == 'T'/'t': dst[pos] == 8
src[pos] == anything else: dst[pos] == 15

It panics if len(dst) != len(src).

func ASCIIToSeq8Inplace

func ASCIIToSeq8Inplace(main []byte)

ASCIIToSeq8Inplace converts the characters of main[pos] as follows:

'A'/'a' -> 1
'C'/'c' -> 2
'G'/'g' -> 4
'T'/'t' -> 8
anything else -> 15

func CleanASCIISeqInplace

func CleanASCIISeqInplace(ascii8 []byte)

CleanASCIISeqInplace capitalizes 'a'/'c'/'g'/'t', and replaces everything non-ACGT with 'N'.

func CleanASCIISeqNoCapitalizeInplace

func CleanASCIISeqNoCapitalizeInplace(ascii8 []byte)

CleanASCIISeqNoCapitalizeInplace replaces everything non-ACGTacgt with 'N'.

func FillFastqRecordBodyFromNibbles

func FillFastqRecordBodyFromNibbles(dst, src []byte, nBase int, baseTablePtr, qualTablePtr *NibbleLookupTable)

FillFastqRecordBodyFromNibbles fills the body (defined as the last three lines) of a 4-line FASTQ record, given a packed 4-bit representation of the base+qual information and the decoding tables. (Windows line-breaks are not supported.)

  • len(dst) must be at least 2 * nBase + 4, but it's allowed to be larger.
  • len(src) must be at least (nBase + 1) >> 1, but it's allowed to be larger.
  • This is designed for read-length >= 32. It still produces the correct result for smaller lengths, but there is a fairly simple faster algorithm (using a pair of 256-element uint16 lookup tables and encoding/binary's binary.LittleEndian.PutUint16() function) for that case, which is being omitted for now due to irrelevance for our current use cases.

func IsNonACGTNPresent

func IsNonACGTNPresent(ascii8 []byte) bool

IsNonACGTNPresent returns true iff there is a non-capital-ACGTN character in the slice.

func IsNonACGTPresent

func IsNonACGTPresent(ascii8 []byte) bool

IsNonACGTPresent returns true iff there is a non-capital-ACGT character in the slice.

func PackSeq

func PackSeq(dst, src []byte)

PackSeq sets the bytes in dst[] as follows:

if pos is even, high 4 bits of dst[pos / 2] := src[pos]
if pos is odd, low 4 bits of dst[pos / 2] := src[pos]
if len(src) is odd, the low 4 bits of dst[len(src) / 2] are zero

It panics if len(dst) != (len(src) + 1) / 2.

This is the inverse of UnpackSeq().

WARNING: Actual values in dst[] bytes may be garbage if any src[] bytes are greater than 15; this function only guarantees that no buffer overflow will occur.

func PackSeqUnsafe

func PackSeqUnsafe(dst, src []byte)

PackSeqUnsafe sets the bytes in dst[] as follows:

if pos is even, high 4 bits of dst[pos / 2] := src[pos]
if pos is odd, low 4 bits of dst[pos / 2] := src[pos]
if len(src) is odd, the low 4 bits of dst[len(src) / 2] are zero

This is the inverse of UnpackSeqUnsafe().

WARNING: This is a function designed to be used in inner loops, which makes assumptions about length and capacity which aren't checked at runtime. Use the safe version of this function when that's a problem. Assumptions #3-4 are always satisfied when the last potentially-size-increasing operation on src[] is simd.{Re}makeUnsafe(), ResizeUnsafe(), or XcapUnsafe(), and the same is true for dst[].

1. len(dst) = (len(src) + 1) / 2.

2. All elements of src[] are less than 16.

3. Capacity of src is at least RoundUpPow2(len(src) + 1, bytesPerVec), and the same is true for dst.

4. The caller does not care if a few bytes past the end of dst[] are changed.

func PackedSeqCount

func PackedSeqCount(seq4 []byte, tablePtr *NibbleLookupTable, startPos, endPos int) int

PackedSeqCount counts the number of .bam base codes in positions startPos..(endPos - 1) of seq4 in the given set, where seq4 is in .bam packed 4-bit big-endian format.

The set must be represented as table[x] == 1 when code x is in the set, and table[x] == 0 when code x isn't.

WARNING: This function does not validate the table, startPos, or endPos. It may crash or return a garbage result on invalid input. (However, it won't corrupt memory.)

func PackedSeqCountTwo

func PackedSeqCountTwo(seq4 []byte, table1Ptr, table2Ptr *NibbleLookupTable, startPos, endPos int) (int, int)

PackedSeqCountTwo counts the number of .bam base codes in positions startPos..(endPos - 1) of seq4 in the given two sets, where seq4 is in .bam packed 4-bit big-endian format.

The sets must be represented as table[x] == 1 when code x is in the set, and table[x] == 0 when code x isn't.

WARNING: This function does not validate the tables, startPos, or endPos. It may crash or return garbage results on invalid input. (However, it won't corrupt memory.)

func ReverseComp2

func ReverseComp2(dst, src []byte)

ReverseComp2 saves the reverse-complement of src[] to dst[], assuming that they're encoded with one byte per base, ACGT=0123. It panics if len(dst) != len(src).

func ReverseComp2Inplace

func ReverseComp2Inplace(acgt8 []byte)

ReverseComp2Inplace reverse-complements acgt8[], assuming that it's encoded with one byte per base, ACGT=0123.

func ReverseComp2Unsafe

func ReverseComp2Unsafe(dst, src []byte)

ReverseComp2Unsafe saves the reverse-complement of src[] to dst[], assuming that they're encoded with one byte per base, ACGT=0123.

WARNING: This is a function designed to be used in inner loops, which makes assumptions about length and capacity which aren't checked at runtime. Use the safe version of this function when that's a problem. Assumptions #2-3 are always satisfied when the last potentially-size-increasing operation on src[] is simd.{Re}makeUnsafe(), ResizeUnsafe(), or XcapUnsafe(), and the same is true of dst[].

1. len(src) == len(dst).

2. Capacity of src is at least RoundUpPow2(len(src) + 1, bytesPerVec), and the same is true of dst.

3. The caller does not care if a few bytes past the end of dst[] are changed.

func ReverseComp2UnsafeInplace

func ReverseComp2UnsafeInplace(acgt8 []byte)

ReverseComp2UnsafeInplace reverse-complements acgt8[], assuming that it's encoded with one byte per base, ACGT=0123.

WARNING: This is a function designed to be used in inner loops, which makes assumptions about length and capacity which aren't checked at runtime. Use the safe version of this function when that's a problem. These assumptions are always satisfied when the last potentially-size-increasing operation on acgt8[] is simd.{Re}makeUnsafe(), ResizeUnsafe(), or XcapUnsafe().

1. Capacity of acgt8[] is at least RoundUpPow2(len(acgt8) + 1, bytesPerVec).

2. The caller does not care if a few bytes past the end of acgt8[] are changed.

func ReverseComp4

func ReverseComp4(dst, src []byte)

ReverseComp4 saves the reverse-complement of src[] to dst[], assuming .bam seq-field encoding with one 4-bit byte per base. It panics if len(dst) != len(src).

WARNING: If a src[] value is larger than 15, it's possible for this to immediately crash, and it's also possible for this to return and fill src[] with garbage. Only promise is that we don't scribble over arbitrary memory.

func ReverseComp4Inplace

func ReverseComp4Inplace(seq8 []byte)

ReverseComp4Inplace reverse-complements seq8[], assuming that it's using .bam seq-field encoding with one 4-bit byte per base.

WARNING: If a seq8[] value is larger than 15, it's possible for this to immediately crash, and it's also possible for this to return and fill seq8[] with garbage. Only promise is that we don't scribble over arbitrary memory.

func ReverseComp4Unsafe

func ReverseComp4Unsafe(dst, src []byte)

ReverseComp4Unsafe saves the reverse-complement of src[] to dst[], assuming .bam seq-field encoding with one 4-bit byte per base.

WARNING: This is a function designed to be used in inner loops, which makes assumptions about length and capacity which aren't checked at runtime. Use the safe version of this function when that's a problem. Assumptions #3-4 are always satisfied when the last potentially-size-increasing operation on src[] is simd.{Re}makeUnsafe(), ResizeUnsafe(), or XcapUnsafe(), and the same is true of dst[].

1. len(src) == len(dst).

2. All elements of src[] are less than 16.

3. Capacity of src is at least RoundUpPow2(len(src) + 1, bytesPerVec), and the same is true of dst.

4. The caller does not care if a few bytes past the end of dst[] are changed.

func ReverseComp4UnsafeInplace

func ReverseComp4UnsafeInplace(seq8 []byte)

ReverseComp4UnsafeInplace reverse-complements seq8[], assuming that it's using .bam seq-field encoding with one 4-bit byte per base.

WARNING: This is a function designed to be used in inner loops, which makes assumptions about length and capacity which aren't checked at runtime. Use the safe version of this function when that's a problem. Assumptions #2-3 are always satisfied when the last potentially-size-increasing operation on seq8[] is simd.{Re}makeUnsafe(), ResizeUnsafe(), or XcapUnsafe().

1. All elements of seq8[] are less than 16.

2. Capacity of seq8 is at least RoundUpPow2(len(seq8) + 1, bytesPerVec).

3. The caller does not care if a few bytes past the end of seq8[] are changed.

func ReverseComp8Inplace

func ReverseComp8Inplace(ascii8 []byte)

ReverseComp8Inplace reverse-complements ascii8[], assuming that it's using ASCII encoding. More precisely, it maps 'A'/'a' to 'T', 'C'/'c' to 'G', 'G'/'g' to 'C', 'T'/'t' to 'A', and everything else to 'N'.

func ReverseComp8InplaceNoValidate

func ReverseComp8InplaceNoValidate(ascii8 []byte)

ReverseComp8InplaceNoValidate reverse-complements ascii8[], assuming that it's using ASCII encoding, and all values are in {0, '0', 'A', 'C', 'G', 'T', 'N', 'a', 'c', 'g', 't', 'n'}.

If the input assumption is satisfied, output is restricted to 'A'/'C'/'G'/'T'/'N'. Other bytes may be written if the input assumption is not satisfied.

This usually takes ~35% less time than the validating function.

func ReverseComp8NoValidate

func ReverseComp8NoValidate(dst, src []byte)

ReverseComp8NoValidate writes the reverse-complement of src[] to dst[], assuming src is using ASCII encoding, and all values are in {0, '0', 'A', 'C', 'G', 'T', 'N', 'a', 'c', 'g', 't', 'n'}.

If the input assumption is satisfied, output is restricted to 'A'/'C'/'G'/'T'/'N'. Other bytes may be written if the input assumption is not satisfied.

It panics if len(dst) != len(src).

func UnpackAndReplaceSeq

func UnpackAndReplaceSeq(dst, src []byte, tablePtr *NibbleLookupTable)

UnpackAndReplaceSeq sets the bytes in dst[] as follows:

if pos is even, dst[pos] := table[src[pos / 2] >> 4]
if pos is odd, dst[pos] := table[src[pos / 2] & 15]

It panics if len(src) != (len(dst) + 1) / 2.

Nothing bad happens if len(dst) is odd and some low bits in the last src[] byte are set, though it's generally good practice to ensure that case doesn't come up.

func UnpackAndReplaceSeqSubset

func UnpackAndReplaceSeqSubset(dst, src []byte, tablePtr *NibbleLookupTable, startPos, endPos int)

UnpackAndReplaceSeqSubset sets the bytes in dst[] as follows:

if srcPos is even, dst[srcPos-startPos] := table[src[srcPos / 2] >> 4]
if srcPos is odd, dst[srcPos-startPos] := table[src[srcPos / 2] & 15]

It panics if len(dst) != endPos - startPos, startPos < 0, or len(src) * 2 < endPos.

func UnpackAndReplaceSeqUnsafe

func UnpackAndReplaceSeqUnsafe(dst, src []byte, tablePtr *NibbleLookupTable)

UnpackAndReplaceSeqUnsafe sets the bytes in dst[] as follows:

if pos is even, dst[pos] := table[src[pos / 2] >> 4]
if pos is odd, dst[pos] := table[src[pos / 2] & 15]

It panics if len(src) != (len(dst) + 1) / 2.

WARNING: This is a function designed to be used in inner loops, which makes assumptions about length and capacity which aren't checked at runtime. Use the safe version of this function when that's a problem. Assumptions #2-#3 are always satisfied when the last potentially-size-increasing operation on src[] is {Re}makeUnsafe(), ResizeUnsafe(), or XcapUnsafe(), and the same is true for dst[].

1. len(src) == (len(dst) + 1) / 2.

2. Capacity of src is at least RoundUpPow2(len(src) + 1, bytesPerVec), and the same is true for dst.

3. The caller does not care if a few bytes past the end of dst[] are changed.

func UnpackSeq

func UnpackSeq(dst, src []byte)

UnpackSeq sets the bytes in dst[] as follows:

if pos is even, dst[pos] := src[pos / 2] >> 4
if pos is odd, dst[pos] := src[pos / 2] & 15

It panics if len(src) != (len(dst) + 1) / 2.

Nothing bad happens if len(dst) is odd and some low bits in the last src[] byte are set, though it's generally good practice to ensure that case doesn't come up.

func UnpackSeqUnsafe

func UnpackSeqUnsafe(dst, src []byte)

UnpackSeqUnsafe sets the bytes in dst[] as follows:

if pos is even, dst[pos] := src[pos / 2] >> 4
if pos is odd, dst[pos] := src[pos / 2] & 15

WARNING: This is a function designed to be used in inner loops, which makes assumptions about length and capacity which aren't checked at runtime. Use the safe version of this function when that's a problem. Assumptions #2-3 are always satisfied when the last potentially-size-increasing operation on src[] is simd.{Re}makeUnsafe(), ResizeUnsafe(), or XcapUnsafe(), and the same is true for dst[].

1. len(src) = (len(dst) + 1) / 2.

2. Capacity of src is at least RoundUpPow2(len(src) + 1, bytesPerVec), and the same is true for dst.

3. The caller does not care if a few bytes past the end of dst[] are changed.

Types

type NibbleLookupTable

type NibbleLookupTable = simd.NibbleLookupTable

NibbleLookupTable is re-exported here to reduce base/simd import clutter.

func MakeNibbleLookupTable

func MakeNibbleLookupTable(table [16]byte) (t NibbleLookupTable)

MakeNibbleLookupTable is re-exported here to reduce base/simd import clutter.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL