Exercise solutions


Exploring variants in Ensembl - solutions


Exercise 1 – Human population genetics and phenotype data  

First, find the variant tab for the variant rs11725853. Search Ensembl for rs11725853 directly from the homepage.

(a) Find the ‘Alleles’ header on the variant summary page shows.

G/A/C is shown as the possible nucleotides at this position, so there are two alternate alleles (A and C), and one reference allele (G).

Click on Phenotype data in the left-hand navigation panel.

The first table in this page has a column called ‘Associated allele’, we can see that A is listed as the risk allele for Urinary albumin excretion rate in type 1 diabetes.

(b) Click on Population genetics in the left-hand navigation panel.

The 1000 Genomes Project Phase 3 is the first resource listed. The variant frequencies are summarised by super-population in the pie-charts, and also in the table below. The East Asian (EAS) population has the highest frequency of the risk allele A, at 35%. The table shows us that the most common genotype in this population is A|G, with ~45% of the population having this genotype.

(c) Click on Citations in the left-hand navigation panel.

This variant is described in the paper PMID:24595857, ‘Genome-wide association study of urinary albumin excretion rate in patients with type 1 diabetes’. Note that you can also find out the PubMed ID from the Phenotype Data page, this may be easier to find relevant papers if the variant has many attributed phenotypes.

(d) The ancestral allele is G, as shown next to the allele information in the summary at the top of the page. To find out more information, click on Phylogenetic context in the left-hand navigation panel.

Click on the Select another alignment button in the blue box and select the 70 eutherian mammals EPO LOW COVERAGE alignment and click on Go.

A region containing the SNP (highlighted in red and placed in the centre) and its flanking sequence are displayed. The G allele is conserved in the majority of this group.


Exercise 2 – Exploring a SNP in human

(a) Go to the Ensembl homepage (http://www.ensembl.org/).

Type rs1801133 in the Search box, then click Go.

Click on  rs1801133.

(b) Click on Genes and Regulation in the side menu (or the Genes and Regulation icon). 

No, rs1801133 is Missense variant in four MTHFR transcripts. It's a downstream gene variant of ENST00000418034.

(c) In Ensembl, the alleles of rs1801133 are given as G/A because these are the alleles in the forward strand of the genome. In the literature and in dbSNP, the alleles are given as C/T because the MTHFR gene is located on the reverse strand. The alleles in the actual gene and transcript sequences are C/T.

(d) Click on  Population genetics in the side menu.

In all populations but two (from the 1000 Genomes and HapMap projects), the allele G is the major one. The two exceptions are: CLM (Colombian in Medellín; 1000 Genomes), HCB (Han Chinese in Beijing, China; HapMap).

(e) Click on Phenotype Data in the left hand side menu.

The specific studies where the association was originally described is given in the Phenotype Data table. Links between rs1801133 and homocysteine levels were described in two papers. Click on the pubmed IDs pubmed:20031578 and pubmed:23824729 for more details. 

(f) Click on Phylogenetic Context in the side menu.

Select Alignment: 8 primates EPO and click Go.

Gorilla, vervet, chimp, macaque, olive baboon and marmoset all have a G in this position. Please note that there is no variation database for gorilla, olive baboon, vervet or marmoset though.


Exercise 3 – Exploring a SNP in mouse

(a) Go to www.ensembl.org, type rs29522348 in the search box. Click on rs29522348 (Mouse Variation).

SNP rs29522348 is located on chromosome 17:73924993. In Ensembl, variant alleles are always provided as on the forward strand.

(b) Find the ‘HGVS names’ header in the summary information. Click on Show to reveal information about HGVS nomenclature.

This SNP has three HGVS names, one at the genomic DNA level (NC_000083.6:g.73924993C>T), one at the transcript level (ENSMUST00000024866.4:c.721G>A) and one at the protein level (ENSMUSP00000024866.4:p.Val241Ile).

(c) In Ensembl, the allele that is present in the reference genome assembly is always put first (C is the allele for the reference mouse genome, strain C57BL/6J). This is referred to as the ‘Reference’ or ‘Major’ allele.

(d) Click on Sample genotypes is the left hand side menu. In the summary of genotypes by population, click on Show for PERLEGEN:MM_PANEL2, or search for the two strain names. 

There are indeed differences between the genotypes reported in those two different strains. The genotype reported in NOD/LTJ is TT whereas in BALB/cByJ the genotype is CC.


Exercise 4 – VEP

Go to www.ensembl.org and click on the link tools at the top of the page. Click on Variant Effect Predictor and enter the three variants as below:

7 117530985  117530985 G/A

7  117531038 117531038  T/C

7 117531068  117531068 T/C

Note: Variation data input can be done in a variety of formats. See more details here http://www.ensembl.org/info/docs/variation/vep/vep_formats.html

Click Run.

When your job is listed as Done, click View Results.

(a) You will get a table with the consequence terms from the Sequence Ontology project (http://www.sequenceontology.org/) (i.e. synonymous, missense, downstream, intronic, 5’ UTR, 3’ UTR, etc) provided by VEP for the listed SNPs.

The variants with ‘missense’ as their consequence annotation are those which cause a change in the protein sequence. All these variants affect the CFTR gene (ENSG00000001626).  Note that you can also upload the VEP results as a track and view them on Location pages in Ensembl, just click on the link in the ‘Location’ column.

(b) Scroll across the table and find the ‘Existing variant’ column.

You can see that all these variants match an existing record for dbSNP (rs IDs) and some match COSMIC (COSM IDs) records.

(c) SIFT and PolyPhen are available for missense SNPs only.

The missense variant at coordinate 7 117531038 T/C been predicted to be probably damaging/deleterious. The other missense variant at coordinate 117531068 is predicted to be benign/tolerated.