README ¶
Semblance
This is a work-in-progress. Expect breaking changes and bugs.
Semblance is a command-line interface for Ensembl's REST API.
Installation
Using go install
(recommended)
This method will download and compile the latest tagged (hopefully stable) version.
You will need to install go
, which should be available in the repositories of most *nix systems.
go install codeberg.org/infinanis/semblance@latest
This command will install to $HOME/go/bin/
by default, so make sure to add it to your path.
From source
You will need make
and go
to build the package.
git clone https://codeberg.org/infinanis/semblance
cd semblance
make
sudo make install
Manual download
TBA
Quickstart guide
The cli interface is driven by subcommands - it consists of sections, which contain specific endpoints as described in Ensembl's API documentation.
Start by running the bare command to see available sections:
$ semblance
semblance
Usage:
semblance [lookup|mapping|sequence|ontology|taxonomy|overlap|compgen|info]
Subcommands:
lookup
mapping
sequence
ontology
taxonomy
overlap
compgen
info
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.
-o --output Redirect output to a file (Default: stdout).
Choose a section of interest and repeat the process above to see the subcommands and their parameters:
$ semblance sequence
sequence
Usage:
sequence [id|region]
Subcommands:
id Request multiple types of sequence by a stable identifier list.
region Returns the genomic sequence of the specified region of the given species. Supports feature masking and expand options.
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.
-o --output Redirect output to a file (Default: stdout).
$ semblance sequence id
id - Request multiple types of sequence by a stable identifier list.
Usage:
id [ids]
Positional Variables:
ids List of Ensembl stable IDs (Values should be separated with a comma) (Example: ENSG00000157764,ENSG00000248378) (Required)
Flags:
--version Displays the program version string.
-h --help Displays help with available flag, subcommand, and positional value parameters.
-db_type Restrict the search to a database other than the default. Useful if you need to use a DB other than core (Example: core)
-end Trim the end of the sequence by this many basepairs. Trimming is relative to reading direction and in the coordinate system of the stable identifier. Parameter can not be used in conjunction with expand_5prime or expand_3prime. (Example: 1000) (default: 0)
-expand_3prime Expand the sequence downstream of the sequence by this many basepairs. Only available when using genomic sequence type. (Example: 1000) (default: 0)
-expand_5prime Expand the sequence upstream of the sequence by this many basepairs. Only available when using genomic sequence type. (Example: 1000) (default: 0)
-format One of: (fasta); Format of the data
-mask One of: (hard,soft); Request the sequence masked for repeat sequences. Hard will mask all repeats as N's and soft will mask repeats as lowercased characters. Only available when using genomic sequence type.
-mask_feature Mask features on the sequence. If sequence is genomic, mask introns. If sequence is cDNA, mask UTRs. Incompatible with the 'mask' option
-object_type Filter by feature type (Example: gene)
-species Species name/alias (Example: homo_sapiens)
-start Trim the start of the sequence by this many basepairs. Trimming is relative to reading direction and in the coordinate system of the stable identifier. Parameter can not be used in conjunction with expand_5prime or expand_3prime. (Example: 1000) (default: 0)
-type One of: (genomic,cds,cdna,protein); Type of sequence. Defaults to genomic where applicable, i.e. not translations. cdna refers to the spliced transcript sequence with UTR; cds refers to the spliced transcript sequence without UTR.
-o --output Redirect output to a file (Default: stdout).
Required positional of subcommand id named ids not found at position 1
Each endpoint has optional parameters (flags) and required parameters (positional values). Sometimes a parameter will have a default value (enforced by the API) and/or an example value, both of which are noted in the parameter's description. Many required parameters allow you to pass more than one value, in which case they must be separated with a comma!
Most endpoints return JSON output (use --yaml
to output YAML instead). One notable exception is the sequence
endpoint, which always produces FASTA files. Output is dumped to stdout by default, but you can redirect it to a file using --output
(a plain shell redirection works fine as well).
If you're ever unsure about an argument, you can pass -h
at any point in the command to get help.
Examples
Download the genomic sequence for human's BRCA1 gene.
First we need to find the gene's stable ID. We can use lookup
to search for a symbol:
$ semblance lookup human brca1
{
"brca1": {
"seq_region_name": "17",
"start": 43044295,
"logic_name": "ensembl_havana_gene_homo_sapiens",
"version": 25,
"assembly_name": "GRCh38",
"object_type": "Gene",
"source": "ensembl_havana",
"end": 43170245,
"description": "BRCA1 DNA repair associated [Source:HGNC Symbol;Acc:HGNC:1100]",
"species": "human",
"strand": -1,
"id": "ENSG00000012048",
"display_name": "BRCA1",
"biotype": "protein_coding",
"canonical_transcript": "ENST00000357654.9",
"db_type": "core"
}
}
We're interested in the id
field: ENSG00000012048
.
Now we can download the sequence:
$ semblance sequence id ENSG00000012048
>ENSG00000012048.25 chromosome:GRCh38:17:43044295:43170245:-1
AAAGCGTGGGAATTACAGATAAATTAAAACTGTGGAACCCCTTTCCTCGGCTGCCGCCAA
GGTGTTCGGTCCTTCCGAGGAAGCTAAGGCCGCGTTGGGGTGAGACCCTCACTTCATCCG
GTGAGTAGCACCGCGTCCGGCAGCCCCAGCCCCACACTCGCCCGCGCTATGGCCTCCGTC
TCCCAGCTTGCCTGCATCTACTCTGCCCTCATTCTGCAGGACTATGAGGTGACCTTTACG
GAGGATAAGATCAATGCCCTTATTAAAGCAGCCAGTGTAAATATTGAAACTTTTTGGCCT
GGCTTGTTTGCAAAGGTCCTGGCCAACGTCAACATTGGGAGCCACATCTGCAGTGTAGAG
GGGGGGAAAAAAACGTGACTGCGCGTCGTGAGCTCGCTGAGACGTTCTGGACGGGGGACA
GGCCGTGGGGTTTCTCAGATAACTGGGCCCCTGGGCTCAGGAGGCCTGCACCCTCTGCTC
TGGGTTAAGGTAGAAGAGCCCCGGGAAAGGGACAGGGGCCCAAGGGATGCTCCGGGGGAC
...
You can use other cli utilities, such as jq, to parse JSON output easily:
$ semblance lookup symbol human brca1 | jq -r '.brca1.id'
ENSG00000012048
Get all available protein-coding transcripts for human GAPDH gene
$ semblance sequence id -type cds $(semblance lookup symbol human gapdh | jq -r '.gapdh.id')
>ENST00000229239.10
ATGGGGAAGGTGAAGGTCGGAGTCAACGGATTTGGTCGTATTGGGCGCCTGGTCACCAGG
GCTGCTTTTAACTCTGGTAAAGTGGATATTGTTGCCATCAATGACCCCTTCATTGACCTC
...
>ENST00000396856.5
ATGGAAGAAATGCGAGATCCCTCCAAAATCAAGTGGGGCGATGCTGGCGCTGAGTACGTC
GTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAGGCTGGGGCTCATTTGCAGGGGGGA
...
Check the definition of 'transcription factor complex' (Gene Ontology database)
$ semblance --yaml ontology name -simple "transcription factor complex"
- synonyms:
- transcription factor complex
- nuclear transcription factor complex
- cytoplasmic transcription factor complex
definition: A protein complex that is capable of associating with DNA by direct binding, or via other DNA-binding proteins or complexes, and regulating transcription.
ontology: GO
accession: GO:0005667
subsets:
- goslim_pir
namespace: cellular_component
name: transcription regulator complex
Map the first 100 bp of a protein-coding gene transcript to the genome and download the corresponding genomic sequence
You were given a transcript id: ENST00000288602
. Let's check what kind of gene we're dealing with first.
$ semblance --yaml lookup id ENST00000288602
ENST00000288602:
seq_region_name: "7"
start: 140734486
Parent: ENSG00000157764
assembly_name: GRCh38
version: 11
logic_name: havana_homo_sapiens
is_canonical: 0
object_type: Transcript
source: havana
end: 140924732
strand: -1
species: homo_sapiens
id: ENST00000288602
display_name: BRAF-201
biotype: protein_coding
db_type: core
We learn that it's a transcript coming from the human BRAF gene. We need to find the genomic coordinates:
$ semblance --yaml mapping cdna2gen ENST00000288602 1..100
mappings:
- strand: -1
rank: 0
coord_system: chromosome
assembly_name: GRCh38
start: 140924633
gap: 0
seq_region_name: "7"
end: 140924732
We now know the coordinates of this region, and that it's located on chromosome 7. We can fetch the genomic sequence:
$ semblance sequence region 7:140924633..140924732 homo_sapiens
>chromosome:GRCh38:7:140924633:140924732:1
TCCATGTCCCCGTTGAACAGAGCCTGGCCCGGCTCCGCGCCGCCACCACCGCCACCGCTC
AGCGCCGCCATCTTATAACCGAGAGCCGGGGCCCGAGCGG
License
Documentation ¶
There is no documentation for this package.