 |
2Can Support Portal - Protein and Proteomic Analysis
A Proteomics Example - Introduction
The following tools and methodologies will be used to to demonstrate a proteomics analysis:
This article has been contributed
by Sandra Orchard and Henning Hermjakob.
|
The Starting Point - a sequence
The starting point of any sort of proteomic analysis will usually be a sequence (peptide or protein).
The EBI services page provides access to a range of tools available at the EBI for proteomic analysis. This includes the InterproScan tool which we will use to assign family membership, identify functional domains etc.
For more information on InterProScan see the specific InterProScan 2can tutorial.
Looking at the results from InterProScan
InterPro is an integrated documentation resource for protein families, domains and sites. InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated diagnostic tool.
InterPro unifies:
- PROSITE regular expressions and profiles
- Pfam, SMART, TIGRFAMs, PIRSF, PANTHER, Gene3D and SUPERFAMILY hidden Markov models (HMMs)
- PRINTS, provider of fingerprints (groups of aligned, un-weighted motifs)
- PRODOM who use Clustr analysis to group sequences
Signatures describing the same protein family, domain repeat or site are grouped into unique InterPro entries. Each combined InterPro entry has a unique accession number, an abstract describing the features of proteins associated with the entry and literature references and has links to the relevant member database(s). All UniProtKB protein sequences that have matches to a particular InterPro entry are listed in the Match Table associated with that entry. There are also links to the InterPro graphical views. The graphical views, which can be sorted by UniProtKB accession number, structure or taxonomy, show the position of the signatures on the protein, mousing over the signature brings up a pop-box, giving the accession, name and position.
InterPro graphically represents the location of a protein domain and information pertaining to the origin of that domain and the proteins that contain it. Families are also defined and may contain several InterPro domains which are often, but not always, in the same order. Through the InterPro Domain Architecture view, the composition and order of the different domains within a family are clearly displayed for easy comparison, as well as for simple navigation between the entries for individual domains.
InterPro and InterProScan are accessible for interactive use over the EBI web server, they are distributed as stand-alone copies by anonymous ftp.
InterPro entries are linked to one another through PARENT/CHILD and CONTAINS/FOUND IN relationships. PARENT/CHILD relationships indicate superfamily/family/subfamily relationships, as well as domain hierarchies, where sequences can be subdivided into more specific sub-sets. CONTAINS/FOUND IN relationships apply to domains, repeats and sites within families, and are used to describe the composition of protein sequences.
A few questions on the results from InterProScan
Q1. What domains/sites does this protein contain?
- Mouse over the signature to bring up a pop-box on the scroll bar, giving the accession, name and position.
- Click on IPR000719 to see what information you can gain about this domain.
Q2. What GO terms can you assign to your protein on the basis of InterPro signature recognition?
- Choose one GO term and copy/paste the GO ID into the text search box at the top of the InterPro entry page to list all the InterPro entries associated with this GO term.
Q3. Returning to the IPR000719 entry page, what relationships does this entry have with other InterPro entries?
- View the PARENT/CHILD tree.
- Look at the 'Overlapping InterPro Entries'; for IPR000719, and find the data for its overlap with IPR008271, the entry for the active site contained within this domain.
Q4. How many proteins that contain the IPR00719 kinase domain also contain the IPR008271 serinine/threonine active site? Are all the amino acids from this active site contained within the IPR00719 domain?
-
At the top of the page are a number of different views possible for the proteins within this entry. Follow the link to ‘of known structure’ under ‘Detailed’ view. Look at the second protein in the list, DCAK1_HUMAN (O15705), and scroll down to look at the domains contained within its various isoforms (yellow background).
Q5. Why are isoforms -3 and -4 significantly different from -1 and -2?
Looking back at DCAK1_HUMAN (O15705), this has a PDB structure (green striped bar) for its doublecortin domain. Note the classification of this domain in CATH (pink striped bar) and in SCOP (black striped bar). The kinase domain has a structural homology domain predicted by ModBase (yellow striped bar).
- Click on the hyperlink MB_O15075 to look at the predicted structure
-
Have a look at the structure of the doublecortin domain using Astexviewer®, by clicking on the
symbol adjacent to the first CATH domain (3.30.200.20.1). Remaining in line view, scroll along the sequence to residue 20 (Y - tyrosine) and clink on ‘Y’. This will zoom in and identify the residue on the structure.
Q6. Is this residue buried or on the external surface of the protein?
- Click on the adjacent residue, but this time clicking on the structure itself.
Q7. What residue is the tyrosine adjacent to?
- Now zoom out again, and have a look at the structure using the ribbon view by clicking on 'cartoon';.
Q8. What is the predominant topology of this protein, alpha helix or beta sheet?
(note: you can depress your left mouse button whilst moving the mouse to rotate the structure to any angle you want in order to get a better view of the structure).
- Returning to the InterProScan page, click on IPR003533
Q9. What additional information can you gain about this protein? Are isoforms –3 and –4 of DCAK1_HUMAN likely to differ functionally from –1 and –2 (hint read the abstract).
- Move down to the taxonomy wheel and open the other fruit fly sequences which contain this domain.
Q10. Is this domain only present in kinases?
Homology to other related proteins is another powerful tool for information on a particular protein.
Return to the EBI services page
Alignments
BLAST (Basic Local Alignment Search Tool), finds regions of sequence similarity and gives functional and evolutionary clues about the structure and function of your novel sequence. WU-BLAST 2.0 and NCBI BLAST2 are distinctly different software packages, although they have a common lineage for some portions of their code, so the two packages do their work differently and obtain different results and offer different features. You can also check for vector contamination with BLAST2 EVEC.
FASTA can be very specific when identifying long regions of low similarity especially for highly diverged sequences. You can also conduct sequence similarity searching against complete proteome or genome databases using the FASTA programs.
MPsrch – Smith and Waterman algorithm, capable of identifying hits in cases where BLAST and FASTA fail and also reports fewer false-positive hits
For this exercise, we will use MPsrch.
For more information on MPsrch see the specific MPsrch 2can tutorial.

Looking at the results from MPsrch
Scores should suggest a 100% match to Q9VCL7_DROME
- Click on 'Show Alignments' to display aligned sequences
The UniProt Consortium is comprised of the European Bioinformatics Institute, the Swiss Institute of Bioinformatics and the Protein Information Resource. The UniProt consortium aims to support biological research by maintaining a high quality database that serves as a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces freely accessible to the scientific community.
All data stored in UniProt can be downloaded from the Download Centre at http://www.uniprot.org/downloads.
- Click on hyperlinked Q9VCL7_DROME open UniProt entry.
This is a UniProt/TrEMBL entry i.e. translated from the nucleotide sequence and only automatic annotation and additional cross-referencing added.
Note
Gene Name=Orf name - this will probably change when protein is characterised
Tax ID as supplied by the NCBI
- Click on the taxonomy link and open NEWT.
The NEWT database is a compilation of the information within the NCBI Taxonomy database together with proteins found in the Swiss-Prot and TrEMBL section of UniProtKB. It is maintained by the SwissProt group in Switzerland. For each species, NEWT displays the following taxonomy data: Swiss-Prot scientific name, Swiss-Prot common name and Swiss-Prot synonym, lineage, number of protein sequence entries in Swiss-Prot and TrEMBL.
The NEWT data is available from the European Bioinformatics Institute
Cross-references
Nucleotide Database – original submission data, identical underlying information stored in EMBL, Genbank and DDBJ but slightly different views.
Ensembl/FlyBase – give a gene-centric view. FlyBase often contains additional literature references which may be of use.
HSSP – Swiss Homology Model. For proteins lacking a PDB entry, gives most similar UniProtKB entry with a 3D structure.
IntAct – Molecular Interaction Database
IntAct is a freely available, open source database system and also provides analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions and are freely available.
- Click on hyperlinked Accession number in IntAct cross-reference line
Protein has interaction with annotated protein of known function - further details of interaction could tell you more about your protein.
Q1. Which proteins does your sequence interact with?
IntAct stores an interaction in the context of the experimental method which described this interaction. If an interaction has been described several times using different methodologies, this will increase the 'Number of Interactions' and also our confidence in the veracity of the interaction.
- Click on Experimental detail to see details of experimental methodology
- Check the box to select a protein and then click 'Graph' to see graphical view.

The experiment is a high throughput Yeast 2-hybrid however author considers this as high confidence data.
- Centre graph by clicking on ARF2_DROME -
Q2. Which processes does GO annotation suggest this protein is involved in?
- Highlight 'Add a network' and click on ARF2_DROME
We will return to IntAct later, but we will continue adding information to our existing Drosophila sequence.
- Return to UniProt entry
- Scroll to top of entry and click on 'Extended View'
This adds automatic annotation (in Green)- in this case adds Similarity Statement and keywords.
Q3. What additional information has this added to our protein of interest?
Searching UniProtKB
Our protein is a kinase and a single Y2H experiment suggests it may interact with an ARF family protein.
Q1. Are Arf family members regulated by phosphorylation?
Search in UniProt for Arf family members
Looking at the results from UniProtKB
- Find all entries with species=Drosophila melanogaster

- Select all entries containing the word 'arf'
- Select Dataset manager
- Use pull down menus and Venn diagram to select set where 'species=Drosophila melanogaster' and 'entry contains text=arf'

You should end up with a set of approximately 30 proteins
- Click on query to view set.
- Open Q9NGC3 (CEG1A_DROME) - this is a fully annotated UniProt/Swiss-Prot entry.
The entry contains several merged TrEMBL entries which have been curated at the sequence level - errors have been seen and corrected, in the sequence (Comments - Caution). The cross-reference to the nucleotide entries have been tagged to show there is an error.
Comments - alternative products - contains several merged TrEMBL entries some of which have been identified as splice isoforms. Most of these have been identified via genome project so have been tagged as not experimentally verified.
¤Press on [Display all isoform sequences in FASTA format]
Comments - interaction. Derived from IntAct, usually chosen by Interaction Detection Method, for example X-ray crystallography (no. interactors=2) gives high confidence of a binary interaction and would be exported to this line.
Additional annotation added by curators reading journal articles and adding information to entry.
Compare our TrEMBL entry to Q8N568 DCAK2_HUMAN (38% similarity by alignment) - could annotation be transferred from human -> fly?
- Return to EBI services page, select UniProt and type Q8N568 into Text Search
- Look at Q00987 (MDM2_HUMAN) for an example of a fully annotated protein with plenty of information known about it.
IntAct
We have accessed IntAct from within UniProt to access information on individual proteins - we will now use the Advanced Search to find more complex datasets.
- Return to EBI services page
- Launch IntAct >>>
- Click on Advanced Search (currently in 'News')
Use Advanced search to look for interactions associated with leukaemia
- Under 'Topic' select 'Disease' on the pull down menu
- Type 'Leuk*' in the free-text box below (this by-passes the difference between UK and US spellings).
Search for any experiments where the interaction was identified using fluorescent resonance energy transfer (FRET) technology.
- Check 'Experiment'
- Scroll down to Interaction Detection and select 'fluorescent resonance' in the pull-down menu
- Click on 'wigelsworth-2004-2' to confirm result.
Note that mutations and binding sites on the interacting molecules are listed under 'Sequence Features'
Now try and repeat some or all of these searches on the following sequences
Protein X
Protein Y
Protein Z
|
|