Exercise solutions

Using BioMart to Export Ensembl Data

 

Exercise 1 — Finding Genes by Protein Domain

As with all BioMart queries you must select the dataset, set your filters (input) and define your attributes (desired output). For this exercise:

Dataset: Ensembl genes in mouse

Filters: Signalp cleavage sites on chromosome 9

Attributes: Ensembl gene and transcript IDs and Associated gene names

Go to the Ensembl homepage (http://www.ensembl.org) and click on BioMart at the top of the page.

Select Ensembl genes as your database and Mus musculus genes as the dataset.

Click on Filters on the left of the screen and expand REGION. Change the chromosome to 9.

Now expand PROTEIN DOMAINS, also under filters, and select Limit to genes, choosing with With Cleavage site (Signalp) from the drop-down and then Only. Clicking on Count should reveal that you have filtered the dataset down to 221 genes.

Click on Attributes and select Sequences. Expand Sequences and select Protein. Now click on Results. The first 10 results are displayed by default; to download your results, use the top panel to choose the file type you want by clicking file and then click GO.

The output will be a file containing table the Ensembl gene ID, Ensembl Transcript ID and Associated gene names of all proteins with a transmembrane domain on mouse chromosome 9.

  

Exercise 2 — Export Homologues

Click New.

Dataset: Choose the Ensembl Genes database and then the Ciona savignyi genes (CSAV2.0) dataset.

Filters: Expand the GENE section by clicking on the + box and enter the gene list in the Input external references ID list box.

Attributes: Select the Homologues attributes page, also expand the Orthologues section by clicking on the + box to select Human Ensembl Gene ID.

Results: Click Unique Results only and expand the preview table to All

 

Exercise 3 — Convert IDs

Click New.

Dataset: Choose the ENSEMBL genes database and then the Homo sapiens genes (GRCh38.p10) dataset.

Filters: Expand the GENE section by clicking on the + box, select Input external references ID list - RefSeq protein ID(s) and enter the list of IDs in the text box (either comma separated or as a list).

HINT: You may have to scroll down the menu to see these.

Count: Shows 11 genes (remember one gene may have multiple splice variants coding for different proteins, that is the reason why these 29 proteins do not correspond to 29 genes).

Attributes: Select the FEATURES attributes page. Expand the External section by clicking on the + box. Select HGNC symbol and RefSeq Protein ID from the External References section.

Results: Select View All rows as HTML or export all results to a file.

 

Exercise 4 — Export Structural Variants

(a) Dataset: Choose Ensembl Variation and Homo sapiens Structural Variation (GRCh38.p10).

Filters: Expand Region and select Chromosome 1, Base pair start: 130408, Base pair end: 210597. Also expand General Structural Variant features and click on Limit to Variants from source: DGVa

Count: Shows 85 out of 5,892,964 structural variants.

Attributes: Expand Structural Variation (SV) Information and click DGVa Study Accession and Source Name. Next, expand Structural Variant (SV) Location and choose Chromosome name, also expand Supporting Structural Variant (SSV) Location and select Sequence region start (bp) and Sequence region end (bp).

Results: Click Unique Results only and expand the preview table to All

 

(b) Dataset: Choose Ensembl Variation and Homo sapiens Short Variation (SNPs and indels) (GRCh38.p10).

Filters: Filter by Variation name enter: rs566014072, rs754099015

Attributes: Variant Name, Variant Alleles, Phenotype description and Associated gene with phenotype.

Click the Results button on the toolbar.

You can view this same information in the Ensembl browser. Click on one of the variation IDs (names) in the result table. The variation tab should open in the Ensembl browser. Click Phenotype Data.

 

Exercise 5 — Find Genes Associated with Array Probes

(a) Click New.

Dataset: Choose the ENSEMBL Genes database, then the Homo sapiens genes (GRCh38.p10) dataset.

Filters: Expand the GENE section by clicking on the + box.Then select Input microarray probes/probesets ID list - Affy hg u133 plus 2 probeset ID(s) and enter the list of probeset IDs in the text box (either comma separated or as a list).

Count: Shows 27 genes match this list of probesets.

Attributes: Select the Features attributes page. Next, expand the GENE section by clicking on the + box and in addition to the default selected attributes, select Description. Now, expand the External section by clicking on the + box. Select HGNC symbol from the External References section and AFFY HG U133-PLUS-2 from the Microarray probes/probesets section.

Click the Results button on the toolbar. Select View All rows as HTML or export all results to a file. Tick the box Unique results only.

 

(b) Don’t change Dataset and Filters – simply click on Attributes.

Attributes: Select the Sequences attributes page. Expand the Sequences section by clicking on the + box. Select Flank (Transcript) and enter 2000 in the Upstream flank text box. Expand the Header information section by clicking on the + box. Select, in addition to the default selected attributes, Description and Associated Gene Name.

Note: Flank (Transcript) will give the flanks for all transcripts of a gene with multiple transcripts. Flank (Gene) will give the flanks for one possible transcript in a gene (the most 5’ coordinates for upstream flanking).

Click the Results button on the toolbar.

 

(c) You can leave the Dataset and Filters the same, and go directly to the Attributes section:

Attributes: Select the Homologues attributes page. Expand the Gene section by clicking on the + box, select Associated Gene Name and deselect Ensembl Transcript ID. Expand the Orthologues section by clicking on the + box. Select Mouse Ensembl Gene ID, Mouse Chromosome Name, Mouse Chr Start (bp) and Mouse Chr End (bp).

Results: Select View All rows as HTML or export all results to a file.

Your results should show that for most of the human genes at least one mouse orthologue has been identified.