Exercise solutions

Exploring variants in Ensembl - solutions


Exercise 1 – Human population genetics and phenotype data  

(a) Please note there is more than one way to get this answer. Either go to the Variation Table for the human TAGAP gene, and Filter variants to the 5’UTR, or search Ensembl for rs1738074 directly.

Once you’re in the Variation tab, click on the Genes and regulation link or icon.

This SNP is found in three transcripts (ENST00000326965, ENST00000338313, and ENST00000367066).

(b) Click on Population genetics at the left of the variation tab. (Or, click on Explore this variation at the left and click the Population genetics icon.)

In Yoruba (YRI), the least frequent genotype is CC at the frequency of 5.6%.

(c) Click on Phylogenetic context.

The ancestral allele is T and it’s inferred from the alignment in primates.

Select the 53 eutherian mammals EPO LOW COVERAGE alignment and click on Go.

A region containing the SNP (highlighted in red and placed in the centre) and its flanking sequence are displayed. The T allele is conserved in all but three of the 53 eutherian mammals displayed.

(d) Click Phenotype Data at the left of the Variation page.

This variant is associated with multiple sclerosis and coeliac. There are known risk alleles for both multiple sclerosis and coeliac and the corresponding P values are provided. The allele A is associated with coeliac disease. Note that the alleles reported by Ensembl are T/C. Ensembl reports alleles on the forward strand. This suggests that A was reported on the reverse strand in the original paper. Similarly, one of the alleles reported for Multiple sclerosis is G.


Exercise 2 – Exploring a SNP in human

(a) Go to the Ensembl homepage (http://www.ensembl.org/).

Type rs1801133 in the Search box, then click Go.

Click on  rs1801133.

(b) Click on Genes and Regulation in the side menu (or the Genes and Regulation icon). 

No, rs1801133 is Missense variant in four MTHFR transcripts. It's a downstream gene variant of ENST00000418034.

(c) In Ensembl, the alleles of rs1801133 are given as G/A because these are the alleles in the forward strand of the genome. In the literature and in dbSNP, the alleles are given as C/T because the MTHFR gene is located on the reverse strand. The alleles in the actual gene and transcript sequences are C/T.

(d) Click on  Population genetics in the side menu.

In all populations but two (from the 1000 Genomes and HapMap projects), the allele G is the major one. The two exceptions are: CLM (Colombian in Medellín; 1000 Genomes), HCB (Han Chinese in Beijing, China; HapMap).

(e) Click on Phenotype Data in the left hand side menu.

The specific studies where the association was originally described is given in the Phenotype Data table. Links between rs1801133 and homocysteine levels were described in two papers. Click on the pubmed IDs pubmed:20031578 and pubmed:23824729 for more details. 

(f) Click on Phylogenetic Context in the side menu.

Select Alignment: 8 primates EPO and click Go.

Gorilla, vervet, chimp, macaque, olive baboon and marmoset all have a G in this position. Please note that there is no variation database for gorilla, olive baboon, vervet or marmoset though.


Exercise 3 – Exploring a SNP in mouse

(a) Go to www.ensembl.org, type rs29522348 in the search box. Click on rs29522348 (Mouse Variation).

SNP rs29522348 is located on 17:73924993. In Ensembl, its alleles are provided as in the forward strand.

(b) Click on HGVS names to reveal information about HGVS nomenclature.

This SNP has three HGVS names, one at the genomic DNA level (6:g.73924993C>T), one at the transcript level (ENSMUST00000024866.4:c.721G>A) and one at the protein level (ENSMUSP00000024866.4:p.Val241Ile).

(c) In Ensembl, the allele that is present in the reference genome assembly is always put first (C is the allele for the reference mouse genome, strain C57BL/6J). 

(d) Click on Sample genotypes is the left hand side menu. In the summary of genotypes by population, click on Show for PERLEGEN:MM_PANEL2, or search for the two strain names. 

There are indeed differences between the genotypes reported in those two different strains. The genotype reported in NOD/LTJ is TT whereas in BALB/cByJ the genotype is CC.


Exercise 4 – VEP

Go to www.ensembl.org and click on the link tools at the top of the page. Click on Variant Effect Predictor and enter the three variants as below:

7 117530985  117530985 G/A

7  117531038 117531038  T/C

7 117531068  117531068 T/C

Note: Variation data input can be done in a variety of formats. See more details here http://www.ensembl.org/info/docs/variation/vep/vep_formats.html

Click Run.

When your job is listed as Done, click View Results.

You will get a table with the consequence terms from the Sequence Ontology project (http://www.sequenceontology.org/) (i.e. synonymous, missense, downstream, intronic, 5’ UTR, 3’ UTR, etc) provided by VEP for the listed SNPs. You can also upload the VEP results as a track and view them on Location pages in Ensembl. SIFT and PolyPhen are available for missense SNPs only. For two of the entered positions, the variations have been predicted to be probably damaging/deleterious (coordinate 117531038) and benign/tolerated (coordinate 117531068). All the three variations have been already described and are known as in rs1800078, rs1800077 and rs35516286 in dbSNP and other sources (databases, literature, etc).