Protein Identifier Cross-Reference logo

Protein Identifier Cross-Reference

spacer
News
March 2013: New EBI Website!

On 4 March, the EMBL-EBI website was relaunched, hoping to give all our visitors a warmer welcome. We?ve been working behind the scenes to give you a more consistent experience, because a big part of our mission is to make molecular data available ? and accessible ? to everyone.

July 2011: New features and databases

We have implemented searching by protein fragments using a homology search (BLAST). In addition queries can be mapped against subsections of Ensembl Genomes corresponding to various taxa. Support for mapping against geneIDs present in Ensembl and Uniprot has also been added.

January 2009: Addition of SEGUIDs

We have implemented the SEGUID algorithm to generate sequence-based unique identifiers, as described in Babnigg, 2006.

January 2009: New databases and bugfixes

We have added the KIPO database as well as several ENSEMBL genome databases to the mapping algorithm. We have also improved the PICR codebase to make it more robust. PICR has been the victim of its own success in recent months and has had problems coping with the demands put on the web application. These problems should now be resolved.

spacer
Using PICR is very simple with very few options that need setting.

Main Search Page Options

Main Search Options

The main search form is divided into four main sections:

  • Input Data
  • Input Parameters
  • Mapping Databases
  • Output Parameters

PICR can be used to map protein identifiers, sequences or, BLAST sequences, so adjust the data type selector accordingly in the Input Data section>. You can paste a list of protein identifiers (one per line), protein sequences in FASTA format, or BLAST sequences also in a FASTA format. Alternatively, you can upload a file containing this data by clicking on the browse button and selecting the appropriate file. Please note that only 100 protein accessions or sequences can be mapped at one time and the maximum size of the uploaded file is 2 mb. Only 25 BLAST sequences may be processed at one time.

Limit By Species

The Input Parameters section can be used to refine your search. By default, PICR will not restrict mappings based on taxonomical information. If you want to obtain mappings for a specific organism, select it from the pull-down list. If the organism you wish to limit to is not in the list, you can type a partial name in the space provided and query the NEWT taxonomy using the Ontology Lookup Service (OLS). A list should appear with the required organism. Any selected value will override the choice selected in the species list above.

Limit By Species

Select which databases you wish to map to from the Mapping Databases section. You can map to any number of databases. Note that the choices can sometimes refer to more than one database. For example, selecting Ensembl will attempt to map to all species-specific Ensembl releases, as is the case for Vega, Trome and Refseq. Selecting SwissProt and TREMBL will also include the splice variant databases of each source database. Selecting the Ensembl Genomes database brings up a list of taxon specific databases to search. If your output type is Simple HTML "UniProt 'best guess'" will be available as a mapping database. "Uniprot 'best guess'" returns the identifier from the longest matching UniProt entry from (in order of precedence) the following subsections of UniProt: Swiss-Prot, TrEMBL, Swiss-Prot varsplic, and TrEMBL varsplic. For more information on the difference between Swiss-prot and TrEMBL see the UniProt faqs. For more information on the varsplic databases see this article.

BLAST options

When you select the BLAST mapping option the BLAST options panel appears. In this panel you can select which BLAST database to use and whether to filter results by identity. Note that when filtering by species in the input parameters, returned BLAST results will be filtered by species as well.

Advanced BLAST options

You can fine tune the BLAST query by clicking on Show advanced BLAST options and filling in any desired options NB: the defaults should be fine for most searches, and should only be changed if you know what you are doing.

Executing A Search

Once all search parameters have been selected, select the desired output format and click on the search button.

  • Simple HTML will return a simple HTML table.
  • Detailed HTML will return a more detailed HTML table.
  • CSV will return a comma-separated value file containing the same information as the simple HTML view.
  • XLS will return an Excel formatted file containing the same information as the simple HMTL view.

Searches will try and collate information from multiple databases and may involve SOAP queries to the NCBI. While your search is being executed, a progress bar will be displayed and refreshed every 2 seconds. Once your search is done, the appropriate result page will be shown.

Search In Progress

Selecting BLAST accessions

If you are searching by BLAST sequences an intermediate page will come up allowing you to select which accessions to submit to the cross-reference search. For each BLAST fragment that was submitted a list of the top results, in order of identity is presented. Choose the accession which best matches the submitted data and click Proceed to Mapping.

Simple HTML view

The program returns to the in progress page to perform cross-reference mapping.

Understanding The Results

Simple HTML view

Simple HTML view

The table is organized such that each row is a submitted accession or sequence and each column represents a selected mapping database. An empty cell means that no mappings could be found to the corresponding database for the search parameters you entered.

Simple HTML view

By default, PICR only returns mappings to active database entries, though many more might be available. PICR queries the Uniprot Archive (UniParc), which is a historical archive of all known protein entries for over 60 protein sequence databases. As entries are deleted or obsoleted from the source databases, they are never deleted from UniParc but are marked as inactive. PICR can include these inactive mappings in the results if the Return only active mappings box is unchecked in the search options. These inactive mappings will be shown in red in both HTML result views but will not be distinguishable from active mappings in the CSV view.

Entries that can map to an active SwissProt or TREMBL may also have additional mappings, which will be shown in blue. These mappings are obtained from the Uniprot Knowledge Base and, while valid, might not have 100% sequence identity to the submitted accession.

Once a search has been done, results can be saved in CSV format or another search can be started.

Simple HTML view

A dialog box will be shown prompting you to save or open your file.

Simple HTML view

If the submitted accession or sequence is not present in the Uniprot Archive, it cannot be mapped at this time.

Simple HTML view

The detailed HTML view will contain additional information not shown in the simple HTML view. Mappings are done on the basis of 100% sequence identity. As such, one protein accession (P29375 in this example) can map to more than one protein sequence. Each sequence will have a UPI (Uniparc Protein Identifier) as well as multiple cross-references. Each cross-reference will contain:

  • the source database
  • the versioned accession
  • if the cross-reference is active or deleted
  • the NEWT taxonomy ID (if available)
  • the corresponding NCBI GI number (if available)
  • the date the entry was added to UniParc
  • the date the entry was last seen or deleted
The same color-coding applies as described above.

spacer
spacer