Exercise solutions

Advanced Access - solutions

Exercise 1 — Attaching URLs of large files

(a) Click on the Custom tracks button in any region in detail view in Ensembl.

A dialogue box labelled ‘Add a custom track’ will appear. We can name our data, for this exercise we will label our data ‘Illumina reads’.

Paste in the URL of the BAM file itself (http://www.ebi.ac.uk/~emily/Workshops/BAM/GRCh38.20.illumina.merged.1.bam)

Since this is a file, the interface is able to detect the “.BAM” file extension, so automatically labels the format as BAM. Click on Add data and close the menu.

(b) Search for the CDH22 gene and click on the location tab. Click on Configure this Page and add the custom track from the ‘Personal Data’ menu. Select Unlimited track style. You can see that there are more RNASeq reads that map to the exons of the gene.

(c) Zoom in to see the sequence itself by dragging out boxes in the view to zoom in or use the scale bar in the top right of the region in detail image.

(d) Click on Configure the Page. and turn off this track by selecting Off in the track style of the personal data track.

You can also remove the custom data by clicking on Manage your Data and then clicking on the trash can icon associated with this data.

 

Exercise 2 — REST API endpoint queries

Complete the following exercises using single REST API endpoint queries.

(a) Go to the REST API documentation page at http://rest.ensembl.org/documentation

Click on GET sequence/region/:species/:region to get the documentation for this command.

Use the documentation to construct a URL in the correct form and add the genomic co-ordinates from chromosome 13 to create the URL i.e: http://rest.ensembl.org/sequence/region/human/13:32889000..32891000:1?content-type=fasta

This URL will give you the sequence.

To return a hard-masked or soft-masked version of the region sequence, use the optional URL additions using the provided format: http://rest.ensembl.org/sequence/region/human/13:32889000..32891000:1?content-type=fasta;mask=hard

http://rest.ensembl.org/sequence/region/human/13:32889000..32891000:1?content-type=fasta;mask=soft

Hard mask will mask all repeats as N's and soft mask will mask repeats as lower-case characters. There are four separate repeat features in this region.

(b) Click on GET xrefs/symbol/:species/:symbol to get the documentation for this command.

Use the documentation to construct a URL in the correct form and add the gene name to create the URL i.e: http://rest.ensembl.org/xrefs/symbol/homo_sapiens/CCR5?content-type=application/json

This URL will give you the Ensembl stable ID: ENSG00000160791.

(c) Click on GET homology/id/:id or GET homology/symbol/:species/:symbol to get the documentation for this command (there are two separate queries that allow you the search by either ID or symbol).

Use the documentation to construct a URL in the correct form and add the Ensembl stable gene ID (from (b)) to create the URL. You can filter by the Latin name of chimpanzee (Pan troglodytes) i.e: http://rest.ensembl.org/homology/id/ENSG00000160791?content-type=application/json;target_species=pan_troglodytes

Or

Use the documentation to construct a URL in the correct form and add the associated name (CCR5) to create the URL. You can filter by the Laton name of chimpanzee (Pan troglodytes) i.e: http://rest.ensembl.org/homology/symbol/human/CCR5?content-type=application/json;target_species=pan_troglodytes

Either of these URLs will give you the human and chimpanzee orthogolous pair.

(d) Click on GET vep/:species/hgvs/:hgvs_notation or GET vep/:species/id/:id to get the documentation for this command (there are two separate queries that allow you the search by either variant ID or HGVS notation).

Use the documentation to construct a URL in the correct form and add the HGVS notation to create the URL i.e: http://rest.ensembl.org/vep/human/hgvs/3:g.46373456_46373487del?content-type=application/json

Or

Use the documentation to construct a URL in the correct form and add the variant ID to create the URL i.e: http://rest.ensembl.org/vep/human/id/rs333?content-type=application/json

This variant is a frameshift variant on the CCR5 gene (listed as “most severe consequence”).

 

Exercise 3 — Methylation data in humans (synoptic exercise)

(a) Go to the Ensembl homepage (www.ensembl.org).

Select Search: Human and type PDHA2 in the for text box. Click Go.

Click on 4:95840019-95841474:1.

Zoom out, so that approximately 5kb region around the PDHA2 gene is shown.

You may want to turn off all tracks that you added to the display in the previous exercises as follows:

Click Configure this page in the side menu.

Click Reset configurationSAVE and close.

(b) Click Configure this page in the side menu.

Type cpg in the Find a track box.

Select CpG islandsSAVE and close.

No CpG islands are shown. As for the inclusion of CpG islands into the Ensembl database for human a minimum length of 400 bp is required, the reason for this could be that the CpG islands in the PDHA2 gene are shorter than 400 bp. However, there is a %GC track, which shows that the region that comprises the 5’ part of the PDHA2 gene and the region directly upstream of the gene has a high %GC (the red line in the %GC track indicates 50% GC). It is difficult / impossible to distinguish individual CpG islands in this track, though.

(c) Click Export data in the side menu.

Click Next>.

Click on Text.

Select and copy the sequence.

Go to http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html.

Paste the sequence into the text box. Click Run.

CpGPlot does confirm the existence of two CpG islands in the PDHA2 gene region of lengths 200 and 263 bp, respectively. So, it is indeed because of their length being less than 400 bp that these CpG islands are not present in the Ensembl database.

(d) The genomic coordinates of your CpG islands are the start coordinates of your region of interest (found at the top of your exported FASTA) plus the coordinates of the islands within that region (from EMBOSS). In my case this is:

First island: start = 95839291 + 734 = 95840025

end = 95839291 + 933 = 95840224

Second island: start = 95839291 + 1058 = 95840349

end = 95839291 +1320 = 95840611

This gives coordinates for my CpG islands in BED format as:

chr4 95840025 95840224 cpg_island_1

chr4 95840349 95840611 cpg_island_2

Click Custom tracks in the side menu.

Type CpG islands in the Name for this upload (optional) box.

Select  Data format: BED.

Copy the BED formatted data into the Paste file box. Click Add data.

Click on Go to nearest region with data: 4:95790125-95890125.

The two CpG islands should now be shown on the Region in detail page. They should coincide with the regions of high %GC.

Zoom in on the two CpG islands.

To display the names of the CpG islands:

Hover over the CpG islands track name.

Hover over the icon of the cog-wheel.

Select Labels.

(e) Drag your CpG islands track so that it is next to the %GC track.

Click Share this page in the side menu.

Select the link and copy.

Paste into your internet browser to view.

(f)        There is a cyan CTCF binding site 5’ of PDHA2.

Click on the feature then the ID ENSR00002011748 to get to the regulatory tab.

The CTCF binding site is active in A549, DND-41, GM12878, H1ESC, HMEC, HSMM, HUVEC, HeLa-S3, K562 and NHEK cells.

Click on Details by cell type, then Select cells and choose A549, DND-41, GM12878, H1ESC, HMEC, HSMM, HUVEC, HeLa-S3, K562 and NHEK and close, then Select evidence and choose ALL ON and close.

CTCF-binding is found in many of the cell types. Rad21 binding is also seen in H1ESC cells.

(g) Click on Configure this page, then select RNASeq models. Turn on the BAM files for all the tissues in Coverage only.

You will see histograms of RNA-seq coverage for each of the tissues. The largest number is for the merged read. For the tissue-specific read, Testes have a higher peak than all the other tissues. There are also wider peaks in the Testes track that cover the whole gene, whereas other tissues only have a peak at the 3’ end of PDHA2.

(h) Click on Configure this page, then select Comparative genomics. Turn on the tracks for the Constrained elements for 40 eutherian mammals and Conservation score for 40 eutherian mammals.

The region of the gene itself has high GERP scores, indicated by constrained elements over most of the gene. There is no apparent difference in the conservation score between the CpG islands and their flanking regions.

(i) Click on the Gene Tab, Gene: PDAH2 and select GO: Biological process.

There are seven terms in the table, the first being GO:0005975, carbohydrate metabolic process.

To export the list use BioMart.

Click on Search BioMart in the table. This will take you to a BioMart results table with the gene and transcript IDs, GO terms and gene position.

Click on Attributes.

Choose Sequences.

Expand SEQUENCES and select Unspliced (Gene).

Expand Header information and deselect Ensembl Transcript ID.

Click Results.

You can export these results if you wish.

(j) Go to the REST API documentation page at http://beta.rest.ensembl.org/documentation.

Click on GET sequence/id/:id to get the documentation for this command.

You will need the stable ID of PDHA2, go to the browser page to find that it is ENSG00000163114.

Use the documentation to construct a URL in the correct form, ie: http://beta.rest.ensembl.org/sequence/id/:id?format=fasta

Add the ID to the URL to create: http://beta.rest.ensembl.org/sequence/id/ENSG00000163114?format=fasta

This URL will give you the sequence.