semblance

command module

v0.0.2 Latest Latest Go to latest Published: Aug 4, 2023 License: BSD-3-Clause Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

codeberg.org/infinanis/semblance

Links

Open Source Insights

README ¶

Semblance

This is a work-in-progress. Expect breaking changes and bugs.

Semblance is a command-line interface for Ensembl's REST API.

Installation

Using `go install` (recommended)

This method will download and compile the latest tagged (hopefully stable) version. You will need to install go, which should be available in the repositories of most *nix systems.

go install codeberg.org/infinanis/semblance@latest

This command will install to $HOME/go/bin/ by default, so make sure to add it to your path.

From source

You will need make and go to build the package.

git clone https://codeberg.org/infinanis/semblance
cd semblance
make
sudo make install

Manual download

TBA

Quickstart guide

The cli interface is driven by subcommands - it consists of sections, which contain specific endpoints as described in Ensembl's API documentation.

Start by running the bare command to see available sections:

$ semblance
semblance

  Usage:
    semblance [lookup|mapping|sequence|ontology|taxonomy|overlap|compgen|info]

  Subcommands: 
    lookup
    mapping
    sequence
    ontology
    taxonomy
    overlap
    compgen
    info

  Flags: 
       --version   Displays the program version string.
    -h --help      Displays help with available flag, subcommand, and positional value parameters.
    -o --output    Redirect output to a file (Default: stdout).

Choose a section of interest and repeat the process above to see the subcommands and their parameters:

$ semblance sequence
sequence

  Usage:
    sequence [id|region]

  Subcommands: 
    id       Request multiple types of sequence by a stable identifier list.
    region   Returns the genomic sequence of the specified region of the given species. Supports feature masking and expand options.

  Flags: 
       --version   Displays the program version string.
    -h --help      Displays help with available flag, subcommand, and positional value parameters.
    -o --output    Redirect output to a file (Default: stdout).

$ semblance sequence id
id - Request multiple types of sequence by a stable identifier list.

  Usage:
    id [ids]

  Positional Variables: 
    ids   List of Ensembl stable IDs (Values should be separated with a comma) (Example: ENSG00000157764,ENSG00000248378) (Required)
  Flags: 
       --version   Displays the program version string.
    -h --help      Displays help with available flag, subcommand, and positional value parameters.
    -db_type           Restrict the search to a database other than the default. Useful if you need to use a DB other than core (Example: core)
    -end           Trim the end of the sequence by this many basepairs. Trimming is relative to reading direction and in the coordinate system of the stable identifier. Parameter can not be used in conjunction with expand_5prime or expand_3prime. (Example: 1000) (default: 0)
    -expand_3prime           Expand the sequence downstream of the sequence by this many basepairs. Only available when using genomic sequence type. (Example: 1000) (default: 0)
    -expand_5prime           Expand the sequence upstream of the sequence by this many basepairs. Only available when using genomic sequence type. (Example: 1000) (default: 0)
    -format           One of: (fasta); Format of the data
    -mask           One of: (hard,soft); Request the sequence masked for repeat sequences. Hard will mask all repeats as N's and soft will mask repeats as lowercased characters. Only available when using genomic sequence type.
    -mask_feature           Mask features on the sequence. If sequence is genomic, mask introns. If sequence is cDNA, mask UTRs. Incompatible with the 'mask' option
    -object_type           Filter by feature type (Example: gene)
    -species           Species name/alias (Example: homo_sapiens)
    -start           Trim the start of the sequence by this many basepairs. Trimming is relative to reading direction and in the coordinate system of the stable identifier. Parameter can not be used in conjunction with expand_5prime or expand_3prime. (Example: 1000) (default: 0)
    -type           One of: (genomic,cds,cdna,protein); Type of sequence. Defaults to genomic where applicable, i.e. not translations. cdna refers to the spliced transcript sequence with UTR; cds refers to the spliced transcript sequence without UTR.
    -o --output    Redirect output to a file (Default: stdout).

Required positional of subcommand id named ids not found at position 1

Each endpoint has optional parameters (flags) and required parameters (positional values). Sometimes a parameter will have a default value (enforced by the API) and/or an example value, both of which are noted in the parameter's description. Many required parameters allow you to pass more than one value, in which case they must be separated with a comma!

Most endpoints return JSON output (use --yaml to output YAML instead). One notable exception is the sequence endpoint, which always produces FASTA files. Output is dumped to stdout by default, but you can redirect it to a file using --output (a plain shell redirection works fine as well).

If you're ever unsure about an argument, you can pass -h at any point in the command to get help.

Examples

Download the genomic sequence for human's BRCA1 gene.

First we need to find the gene's stable ID. We can use lookup to search for a symbol:

$ semblance lookup human brca1
{
  "brca1": {
    "seq_region_name": "17",
    "start": 43044295,
    "logic_name": "ensembl_havana_gene_homo_sapiens",
    "version": 25,
    "assembly_name": "GRCh38",
    "object_type": "Gene",
    "source": "ensembl_havana",
    "end": 43170245,
    "description": "BRCA1 DNA repair associated [Source:HGNC Symbol;Acc:HGNC:1100]",
    "species": "human",
    "strand": -1,
    "id": "ENSG00000012048",
    "display_name": "BRCA1",
    "biotype": "protein_coding",
    "canonical_transcript": "ENST00000357654.9",
    "db_type": "core"
  }
}

We're interested in the id field: ENSG00000012048.

Now we can download the sequence:

$ semblance sequence id ENSG00000012048
>ENSG00000012048.25 chromosome:GRCh38:17:43044295:43170245:-1
AAAGCGTGGGAATTACAGATAAATTAAAACTGTGGAACCCCTTTCCTCGGCTGCCGCCAA
GGTGTTCGGTCCTTCCGAGGAAGCTAAGGCCGCGTTGGGGTGAGACCCTCACTTCATCCG
GTGAGTAGCACCGCGTCCGGCAGCCCCAGCCCCACACTCGCCCGCGCTATGGCCTCCGTC
TCCCAGCTTGCCTGCATCTACTCTGCCCTCATTCTGCAGGACTATGAGGTGACCTTTACG
GAGGATAAGATCAATGCCCTTATTAAAGCAGCCAGTGTAAATATTGAAACTTTTTGGCCT
GGCTTGTTTGCAAAGGTCCTGGCCAACGTCAACATTGGGAGCCACATCTGCAGTGTAGAG
GGGGGGAAAAAAACGTGACTGCGCGTCGTGAGCTCGCTGAGACGTTCTGGACGGGGGACA
GGCCGTGGGGTTTCTCAGATAACTGGGCCCCTGGGCTCAGGAGGCCTGCACCCTCTGCTC
TGGGTTAAGGTAGAAGAGCCCCGGGAAAGGGACAGGGGCCCAAGGGATGCTCCGGGGGAC
...

You can use other cli utilities, such as jq, to parse JSON output easily:

$ semblance lookup symbol human brca1 | jq -r '.brca1.id'
ENSG00000012048

Get all available protein-coding transcripts for human GAPDH gene

$ semblance sequence id -type cds $(semblance lookup symbol human gapdh | jq -r '.gapdh.id')
>ENST00000229239.10
ATGGGGAAGGTGAAGGTCGGAGTCAACGGATTTGGTCGTATTGGGCGCCTGGTCACCAGG
GCTGCTTTTAACTCTGGTAAAGTGGATATTGTTGCCATCAATGACCCCTTCATTGACCTC
...
>ENST00000396856.5
ATGGAAGAAATGCGAGATCCCTCCAAAATCAAGTGGGGCGATGCTGGCGCTGAGTACGTC
GTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAGGCTGGGGCTCATTTGCAGGGGGGA
...

Check the definition of 'transcription factor complex' (Gene Ontology database)

$ semblance --yaml ontology name -simple "transcription factor complex"
- synonyms:
    - transcription factor complex
    - nuclear transcription factor complex
    - cytoplasmic transcription factor complex
  definition: A protein complex that is capable of associating with DNA by direct binding, or via other DNA-binding proteins or complexes, and regulating transcription.
  ontology: GO
  accession: GO:0005667
  subsets:
    - goslim_pir
  namespace: cellular_component
  name: transcription regulator complex

Map the first 100 bp of a protein-coding gene transcript to the genome and download the corresponding genomic sequence

You were given a transcript id: ENST00000288602. Let's check what kind of gene we're dealing with first.

$ semblance --yaml lookup id ENST00000288602
ENST00000288602:
  seq_region_name: "7"
  start: 140734486
  Parent: ENSG00000157764
  assembly_name: GRCh38
  version: 11
  logic_name: havana_homo_sapiens
  is_canonical: 0
  object_type: Transcript
  source: havana
  end: 140924732
  strand: -1
  species: homo_sapiens
  id: ENST00000288602
  display_name: BRAF-201
  biotype: protein_coding
  db_type: core

We learn that it's a transcript coming from the human BRAF gene. We need to find the genomic coordinates:

$ semblance --yaml mapping cdna2gen ENST00000288602 1..100
mappings:
  - strand: -1
    rank: 0
    coord_system: chromosome
    assembly_name: GRCh38
    start: 140924633
    gap: 0
    seq_region_name: "7"
    end: 140924732

We now know the coordinates of this region, and that it's located on chromosome 7. We can fetch the genomic sequence:

$ semblance sequence region 7:140924633..140924732 homo_sapiens
>chromosome:GRCh38:7:140924633:140924732:1
TCCATGTCCCCGTTGAACAGAGCCTGGCCCGGCTCCGCGCCGCCACCACCGCCACCGCTC
AGCGCCGCCATCTTATAACCGAGAGCCGGGGCCCGAGCGG