Refasta (temporary name)
This is a program to convert various biology formats from one into another.
Warning this project is Very Alpha, and its interface will change frequently.
Its also abandoned as my reasons for developing this have dried up.
This is born out of the complexity that arrises from the abuse and missuse of
biology file formats, such as fasta,
or the complexity of the formats, such as TNT.
Installation
if you have go
installed on your system, you can go get github.com/yarbelk/refasta
Otherwise, look at the releases page.
TODO
- Read a Fasta file, output a fasta file
- Species Name and Gene Name schemas
- Read a Fasta File, output a TNT file
ccode and cgroup can be ignored
- Support blocks and cnames in TNT
- Support single 'block' in tnt. This needs conditional using of xgroups
when the number of blocks == 1, and blocks when greater (verify this)
- Question: What is the difference between xgroup and block?
- Support Outgroup definition in TNT (using outgroup command)
- In depth handling of '-h' from the interface; the simple one line usages
are not enough.
- Structure configuration in such a way that reproducable pipelines can be
easily set up, and the pipeline can be saved as a byproduct of a manual
run.
- switch to using cli for the cli: this
supports loading all arguments from the ENV or yaml files.
- Implement loading and saving of pipelines using cli.
- Document said usage
- Coherent Errors: All failure modes must have human readable errors, that
the bioinformation can use to identify where the bad data is.
- Refactor out the sequence specific stuf from tnt into sequence
- Guess the Species from the name. This is also very specific to one
kind of usage of the FASTA format. Specifically using it as an interchange
between something and TNT. This should probably be a flag.
- Just use the Name
- Regexp rule
- Read a Fasta File, output a Nexus File
- Identify potentially missnamed species ( species names off by
white space, special characters, or a couple characters
by some language disntance metric
- Support Interleaving of Fasta
- Support Interleaving of TNT