Advanced exercises

Advanced Access

Exercise 1 — Attaching URLs of large files

Larger files, such as BAM files generated by NGS, need to be attached by URL. There is a BAM file of human chromosome 20 RNASeq data online at: http://www.ebi.ac.uk/~emily/Workshops/BAM/

Here you can see a number of BAM files (.bam) with corresponding index files (.bam.bai). We’re interested in the files GRCh38.20.illumina.merged.1.bam and GRCh38.20.illumina.merged.1.bam.bai. These files are the BAM file and the index file respectively. When attaching a BAM file to Ensembl, there must be an index file in the same folder.

(a) Attach and view the BAM file of human chromosome 20 RNASeq data.

(b) Go to the region on chromosome 20 that contains gene CDH22. Configure the page to show your added track in the ‘unlimited’ style. What is the relationship between the number of RNASeq reads and the exons of CDH22?

(c) Zoom onto exon 1 of CDH22 so that you can see the the individual sequence of the RNASeq reads.

(d) Remove the track from your region in detail view.

 

Exercise 2 — REST API endpoint queries

Complete the following exercises using single REST API endpoint queries.

(a) Get the sequence for the region from basepair 32889000 to 32891000 of human chromosome 13 in FASTA format. Hard-mask and soft-mask the sequence. How many repeat regions are there in this sequence?

(b) Get the Ensembl Gene ID for the human CCR5 (chemokine (C-C motif) receptor 5) gene.

(c) Has an orthologue for this gene been identified in chimpanzee?

(d) A famous variant in the human CCR5 gene is the delta 32 allele, a 32-basepair deletion at position 46373456-46373487 (rs333). Individuals carrying one copy of the delta 32 allele seem to be resistant to infection by HIV, the virus that causes AIDS, and individuals with two copies (delta 32 homozygotes, ~1% of Caucasians) are almost completely immune to infection by HIV. The delta 32 allele may have been selected for in European populations because it confers resistance to plague (Black Death) or smallpox.

The HGVS notation for this variant is 3:g.46373456_46373487

What is the effect of this variant on the CCR5 protein?

 

Exercise 3 — Methylation data in human (synoptic exercise)

This exercise requires you to combine the knowledge you have gained about different aspects of Ensembl. It is designed to be challenging and force you to come up with solutions yourself.

The human PDHA2 gene, that encodes for a subunit of the pyruvate dehydrogenase complex, is exclusively expressed in spermatogenic cells. In the paper ‘Human testis-specific PDHA2 gene: Methylation status of a CpG island in the open reading frame correlates with transcriptional Activity’ (Pinheiro et al Mol Genet Metab. 2010 Apr;99(4):425-30), two CpG islands in the PDHA2 gene are reported, one encompassing the core promoter region and extending into the open reading frame, the other exclusively located in the coding region. The latter CpG island was shown to be methylated in somatic tissues but demethylated in testicular germ cells and has therefore been proposed to play an important role in the tissue-specific expression of the PDHA2 gene.

(a) Find the PDHA2 gene for human and go to the Region in detail page. Zoom out so that 5 kb around the PDHA2 gene is shown.

(b) Turn on the CpG islands track. Two CpG islands are reported in the PDHA2 gene by Pinheiro et al (2010). Do they appear in this track? If not, why not? 

(c) Confirm the existence of the two CpG islands using the EMBOSS program CpGPlot (http://www.ebi.ac.uk/Tools/seqstats/emboss_cpgplot) on the sequence around the PDHA2 gene.

(d) Upload the CpG islands found by CpGPlot using Custom tracks. Use BED format, which in its simplest form just consists of the chromosome and the start and end coordinates, separated by spaces (as an optional fourth field, you can add a name/description). The genomic start and end coordinates of the CpG islands can be calculated from the genomic start coordinate of the sequence on which the CpGPlot program was run and the relative location of the CpG islands on this sequence as given by the CpGPlot output.

(e) Create a link to allow you to show your new BED track to colleagues, compared to the %GC track.

(f) Is there a regulatory feature at the 5’ end of the PDHA2 gene? What type? Which cell type(s) is it active in? What evidence supports the presence of this feature?

(g) Turn on the RNA-seq tracks for different tissues. Is there evidence that PDHA2 is expressed in one tissue more than others?

(h) How well conserved is the region of the PDHA2 gene amongst the 40 eutherian mammals? Are the CpG islands conserved?

(i) How many biological processes are associated with PDHA2 by Gene Ontology (GO) terms? Can you export the sequences of all human genes that are also associated with the first of these terms?

(j) Can you fetch the gene sequence for PDHA2 in FASTA using the Ensembl REST API?