Exercise solutions

 

Using BioMart to Export Ensembl Data

 

Exercise 1 — Finding Genes by Protein Domain

As with all BioMart queries you must select the dataset, set your filters (input) and define your attributes (desired output). For this exercise:

Dataset: Ensembl genes in mouse

Filters: Signalp cleavage sites on chromosome 9

Attributes: Ensembl gene and transcript IDs and Associated gene names

Go to the Ensembl homepage (http://www.ensembl.org) and click on BioMart at the top of the page.

Step 1: Dataset: Select Ensembl Genes as your database, and then select Mouse genes as the dataset.

Step 2: Filters: click on Filters on the left of the screen

Expand REGION. Change the chromosome to 9.

Scroll down and expand the PROTEIN DOMAINS section, and select Limit to genes, choosing with With Cleavage site (Signalp) from the drop-down and then Only. Clicking on Count should reveal that you have filtered the dataset down to 221 genes.

Step 3: Attributes: click on Attributes on the left of the screen

Select Sequences. Expand Sequences and select Peptide, it may already be selected.

Step 4: Results: Click the Results button at the top left of the page.  

The first 10 results are displayed by default; to download your results click GO. Note that we only have the option to download as FASTA format because we are downloading sequences, other format options are available for exporting tables.

  

Exercise 2 — Export Homologues

Click the New button at the top left of the page.

Step 1: Dataset Choose the Ensembl Genes database and then the Ciona savignyi genes dataset.

Step 2: Filters

Expand the GENE section and enter the gene list in the Input external references ID list box.

Note that you have to ensure the format you are inputting must match the format in the drop down menu above the box. You can check using the Count button that your IDs have been accepted. 

Step 3: Attributes 

Select the Homologues option at the top of teh Attributes page, expand the ORTHOLOGUES section, scroll down to find the Human Orthologues section and choose Human Ensembl Gene ID.

Step 4: Results 

Click Unique Results only and expand the preview table to All

 

Exercise 3 — Convert IDs

Click New.

Step 1: Dataset Choose the Ensembl Genes database and then the Human genes dataset.

Step 2: Filters 

Expand the GENE section, select Input external references ID list. From the drop down list choose  RefSeq peptide ID(s) [e.g. NP_001001130] and enter the list of IDs in the text box (either comma separated or as a carriage-returned list).

Click the Count button, this shows 11 genes (remember one gene may have multiple splice variants/transcripts coding for different proteins, that is the reason why these 29 proteins do not correspond to 29 genes).

Step 3: Attributes

Select the FEATURES attributes page. Expand the External section by clicking on the + box. Select HGNC symbol and RefSeq Protein ID from the External References section.

Step 4: Results 

Select View All rows as HTML or export all results to a file.

 

Exercise 4 — Export Variants

(a) Click New.

Step 1: Dataset Choose Ensembl Variation and Human Structural Variants.

Step 2: Filters

Expand REGION and select Chromosome 1, Base pair start: 130408, Base pair end: 210597. Also expand GENERAL STRUCTURAL VARIANT FILTERS and click on Limit to Variants from source: DGVa if this is not already selected.

Click on count, this shows 87 out of 6,007,985 structural variants.

Step 3: Attributes

Click Study accession and Source Name. Ensure that Chromosome/scaffold name, position start and end are selected.

Step 4: Results

Click Unique Results only and expand the preview table to All

 

(b) Click New.

Step 1: Dataset Choose Ensembl Variation and Human Short Variation (SNPs and indels excluding flagged variants)

Step 2: Filters

Expand the GENERAL VARIANT FILTERS, choose Filtter by Variation name and enter: rs566014072, rs754099015

Step 3: Attributes

Expand the VARIANT ASSOCIATED INFORMATION, choose Variant name, Variant alleles, scroll down to the Phenotype annotation section and choose Phenotype description and Associated gene with phenotype. Uncheck the information about Chromosomes and stard/end positions.

Step 4: Results

You can view this same information in the Ensembl browser. Click on one of the variation IDs (names) in the result table. The variation tab should open in the Ensembl browser. Click Phenotype Data.

 

Exercise 5 — Find Genes Associated with Array Probes

(a) Click New.

Step 1: Dataset Choose the Ensembl Genes database, then the Human genes dataset.

Step 2: Filters 

Expand the GENE section and select Input microarray probes/probesets ID list. Choose AFFY HG U133 Plus 2 probe ID(s) [e.g. 1553551_s_at] from the drop down list above and enter the list of probeset IDs in the text box (either comma separated or as a list). Count shows that 27 genes match this list of probesets.

Step 3: Attributes 

Expand GENE, select Description (Gene and Transcript IDs are already selected). Scroll down and expand the EXTERNAL. Find the External References section and choose HGNC symbol, scroll down to find the Microarray probes/probesets section and choose  AFFY HG U133 Plus 2 probe.

Step 4: Results  

Select View All rows as HTML or export all results to a file. Tick the box Unique results only.

 

(b) Don’t change Dataset and Filters – simply click on Attributes.

Step 3: Attributes 

Select the Sequences option at the top of the attributes page.

Expand the SEQUENCES section. Select Flank (Transcript) and enter 2000 in the Upstream flank text box. Expand the HEADER INFORMATION section. Select, in addition to the default selected attributes, Gene description and Gene name.

Note: Flank (Transcript) will give the flanks for all transcripts of a gene with multiple transcripts. Flank (Gene) will give the flanks for one possible transcript in a gene (the most 5’ coordinates for upstream flanking).

Step 4: Results  

Download the FASTA file.

 

(c) Don’t change Dataset and Filters – simply click on Attributes.

Step 3: Attributes 

Select the Homologues option at the top of the attributes page.

Expand the GENE section, select Gene name and deselect Transcript stable ID. Expand the ORTHOLOGUES.

Scroll down to find the Mouse Orthologues section. Select Mouse gene stable ID, Mouse chromosome/scaffold name, Mouse chromosome/scaffold start (bp) and Mouse chromosome/scaffold end (bp).

Step 4: Results:

Select View All rows as HTML or export all results to a file.

Your results should show that for most of the human genes at least one mouse orthologue has been identified.