Protein Identifier Cross-Reference

Implementation

Our aim in starting this project was to build a solution that would answer our needs:
  • the ability to map sequences, protein identifiers and BLAST fragments
  • identifiers could come from multiple sources in one request
  • identifiers could be mapped to multiple data sources in one request
  • mappings could be done interactively or programmatically
  • mappings could be limited to specific taxon identifiers or across all species
  • mappings could handle identifiers deleted from source databases but still available in result sets and scientific literature

System architecture

PICR is built using a classic 3-tier application model, as illustrated in Figure 1.

Architecture Diagram

The data layer is built around the UniProt Archive (UniParc). The logic layer uses an API written in Java to implement the mapping algorithm described below and return JAXB-annotated data model objects to the presentation layer. The presentation layer uses Servlets and Java Server Pages (JSP) in the context of an Apache Struts application. To make the application more responsive and provide a better browsing experience, AJAX is used wherever possible. The presentation layer also provides a JAX-WS implemented SOAP service and a REST API. To improve performance, database connection pooling (DBCP) is done using Oracle DBCP API at the data layer and caching is done where possible using the OpenSymphony Cache API. Logging is done using Log4J and real-time error reporting is done using the JavaMail API.

Data Model

The data model for PICR is very simple and revolves around two objects: UPEntry and CrossReference. The XML schema of these objects is shown in Figure 2.

PICR Schema

The first represents an entry in the UniParc database and will contain a protein sequence and CRC64, a timestamp and two collections of CrossReference objects - one based on sequence identity and obtained from the XREF table of UniParc and one based on the data from the UniProtKB. The meaning of each collection will be elaborated on in the explanation of the mapping algorithm, below. CrossReference objects contain the description of the source database they originate from, the accession and version of the entry, a status flag indicating if the entry is active (i.e. still available in the source database release files) or inactive (i.e. deleted from the source database), the date the entry was first loaded into UniParc as well as additional information such as the NEWT taxonomy id (if available), the corresponding NCBI GI (if available) and the date the entry was last loaded (if still active) or the date the entry was deleted (if such is the case).

Mapping algorithm

UniParc is the central data warehouse for PICR, though it can be complemented by external sources on occasion. The central tenet of UniParc is that each version of each sequence from each source database will be archived. Source databases are polled daily and updates are loaded into the UniParc as soon as they become available. As such, UniParc is the largest and most comprehensive historical sequence archive available. At time of writing, it currently contains 14.97 million distinct sequences loaded from 4260 releases obtained from 67 distinct sources. This corresponds to 33.8 million non-unique protein identifiers and 29.5 unique protein identifiers. The disparity in the numbers is due to the nature of UniParc. As protein entries are updated, identifiers may be assigned to different protein sequences if the sequence associated with it has changed. Protein sequences are stored in the Protein table and assigned a unique UniParc Protein Identifier (UPI) that will be invariant for the life of the protein sequence. As each source database is loaded in UniParc, if a protein sequence is already present, the source database identifier will be created or updated in the Xref table. If the protein sequence is novel, a new Protein entry will be created with an associated Xref entry.

The complete mapping algorithm is illustrated in Figure 3 and has two phases.

Mapping Algorithm

The first is to find the proper Protein entries that correspond to the input data, be it sequences or accessions. The second is to gather all known cross-references for each entry that fit the search criteria.

Mapping by Sequence

Once a sequence is submitted for mapping, a CRC64 is computed for that sequence and is used to query the Protein table of UniParc. Mappings are done on the basis of 100% sequence identity over the whole sequence. Subsequence matches are not considered as valid mappings as they will not generate identical CRC64 values. If no entries are found, the sequence cannot be mapped. If multiple entries are found (CRC64 collisions being infrequent but will occur given the size of UniParc), the sequences are retrieved from UniParc and only the matching one is kept. A UPEntry object is created and the UPI, sequence and timestamps fields are populated. The value of the UPI of the correctly identified sequence is used to retrieve the Xref entries associated with that sequence, based on the search criteria. These criteria include selected databases to map to, the possibility to retrieve all mappings (including inactive or deleted cross-references) or only active ones and the possibility to limit mappings to a selected species. The entries obtained from the Xref table will be used to create CrossReference objects and will be added to the IdenticalCrossReference collection of the UPEntry object as they are all based on 100% sequence identity.

If the submitted sequence happens to have an active UniProtKB (SwissProt or TREMBL), additional data is looked up in a separate table in the UniParc schema. The Load_DR_UniProt table will contain additional information extracted from the current UniProtKB release files, including secondary identifiers, UniProtKB IDs (ex: JAD1A_HUMAN for the protein whose accession is P29375) and cross-references maintained by UniProtKB to data sources available in UniParc. These human-annotated (SwissProt) and automatically-derived (TREMBL) cross-references can provide added value as the mappings they provide, while valid, might be to sequences that are different to the main UniProtKB sequence (such as splice variants, sequencing errors, natural variations, etc). Such mappings would not normally have been available via UniParc unless the exact variant sequence was queried. However, since they may not represent the exact sequence, it was decided to keep them separated from those obtained based on sequence identity. As such, CrossReference objects created from those records are stored in the LogicalCrossReference collection of the UPEntry.

In order to ensure the maximal mapping coverage, if the user doesn't specify SWISSPROT or TREMBL as mapping databases, they will be added for internal use. The addition of those databases might mean that an active SP/TR record is found, which will then be used to query the load_dr_uniprot table with the databases that the user submitted. The internal SP/TR xrefs will be pruned out of the UPEntry object before being returned to the user.

Mapping by accession

Mapping by protein identifier uses similar logic as that described above, but with a different starting point. If a protein accession is submitted, the Load_DR_UniProt and Xref tables are queried to obtain all pertinent UPIs. A UPEntry is created for each UPI and the relevant fields are populated from data gathered in the Protein table. The CrossReference collections of each UPI are then populated using the mechanisms described above.

Mapping by BLAST fragment

Mapping by BLAST fragment adds another step to the algorithm. The submitted amino acid sequence, along with the BLAST options are forwarded to a server running NCBI BLAST. The server returns the top results for each fragment in the form of a multimap. When running interactively the user is given the top 5 results along with a their identity value and a brief description and chooses which resultant accesiion to query against UNIPARC. In programmatic mode the accession with the top E-value is queried against UNIPARC. From this point the algorithm proceeds identically to mapping by accession.