spacer

SIFTS Methods

Figure 1

We have adopted the NCBI taxonomic identifiers as a standard way of representing the taxonomy information for all of the PDB entries within the PDBe database. In the ideal case every PDB entry should have a record of the organism from which each component of this particular structure derives, but in the legacy archive the situation is far from ideal: many entries simply have no such record, whilst those records that are present have historically been prone to typographical or spelling errors. For entries with no taxonomy information, manual searches of the PDB file or accompanying literature were performed and for all entries we have put in place mechanisms that automatically check the user-supplied taxonomy information against the NCBI database, using the standard NCBI taxonomy identifier that we assign to each PDB entry. This allows us to correct spelling mistakes in legacy PDB files and to identify PDB entries where the taxonomy information is simply incorrect. Furthermore by using a stable, curated taxonomy identifier throughout the database, we gain access to the wealth of annotation information in the NCBI database or UniProt Taxonomy database, such as synonyms and hierarchical relationships between different taxonomic nodes. Figure 1 shows the database schema for taxonomy data mart in the PDBe database.

The cleaned-up taxonomic information for every macromolecular structure is available in the XML files from the FTP archive.

We have used sequence identity and taxonomy as the characteristics on which to link protein sequence data (from UniProt) and protein structure data (from PDBe).

Since the sequences of a structure in the PDBe may represent either the native protein sequence or that of an engineered mutant or other variant, during the automatic procedure, the criterion for assessing sequence identity was that there should be 95% or higher agreement between the sequence of a protein structure and the corresponding sequence in UniProt. If no match is found then this criterion for sequence identity was relaxed further down to 90% during the manual annotation. For entries which are not represented in the Uniprot archive, new Uniprot entries were created based on the information given in the PDB entry.

Because protein structure is more conserved across evolutionary time than is protein sequence and the structural differences between proteins with high sequence identity are small, the rule for assessing taxonomy assignments to accept the correct Uniprot cross-reference was relaxed to allow the taxonomy ID for the two entries, PDBe and UniProt, to be the same or to have a common parent within one or two levels up the taxonomic tree. Using the above rule, we have also cleaned up the UniProt cross-references for every entry in the PDB.Figure 2 shows the database schema for Uniprot cross reference data mart in the PDBe database.

Figure 2

The clean-up of the Uniprot cross references has allowed us to link the macromolecular structure information to other important data resources such as-

  • GOA database which provides assignments of gene products to the Gene Ontology (GO) resource
  • Interpro database which provides information on protein families, domains and functional sites.
  • Pfam database which is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families.
  • IntEnz database which provied up-to-date information on Enzyme Nomenclature
  • SCOP database of protein families based on protein structure
  • CATH database which is a hierarchical classification of protein domain structures
  • Pubmed citation database
  • Residue level mapping

    After completing the clean-up of archive, it was possible to map accurately the sequences from PDB entries on to corresponding UniProt entries. The main difficulty in determining this mapping is that many structures in the PDB have regions of unobserved residues in the middle of continuous polypeptide chains. This discontinuity in the sequence of the structure arises because it is often impossible to reliably construct a model for poorly defined regions of structure, such as flexible loops. Such gaps in the sequence are not taken into account by traditional sequence alignment algorithms, leading to incorrect alignments for regions flanking the unobserved regions.

    To circumvent this problem we modified the standard alignment protocol and developed software to use sequences of connected segments of a polypeptide chain from the PDB entry, corresponding to the observed regions of a protein structure. The separate alignments for these segments were then merged together to assemble the complete alignment between the sequence of the observed residues from the PDB entry and the complete sequence of the protein that was used in the experiment. This latter sequence is shown in the SEQRES record in the PDB entry and does not have gaps reflecting unobserved residues. A similar procedure was carried out to obtain alignments between the sequences of observed residues and the corresponding UniProt entry. These two composite alignments were then merged to give the complete residue -level mapping between the sequence of the complete polypeptide from the experiment and its UniProt counterpart. This complex procedure also allows us to extract annotations from the PDB and UniProt entries to explain any differences that were detected between the two sequences, such as variants, isoforms, modified residues or engineered mutations. Unobserved residues and N- or C-terminal tags for the polypeptide chains in the PDB entry are also annotated. Regions from the UniProt entry that do not form part of the studied polypeptide and are not included in the PDB entry are clearly marked.

    The program also copes with the more complex situation in chimeric structures, where sequences from two or more UniProt entries are involved. In this case the correct boundaries are manually confirmed and this information is stored in a temporary table in the database. The programme uses this information to identify the correct alignments for each segment of the polypeptide chain.

    Data update mechanism

    Both the PDBe and Uniprot groups have developed relational databases to store their data. The databases are implemented in Oracle and are used as the primary archival system for the data. This has allowed us to use various mechanisms provided by Oracle to exchange information between the two databases without exporting the data into flat-files. Figure 3 shows how the data is exchanged between the databases.

    Figure 3

    When new PDB entries are deposited, the source taxonomy is validated against the NCBI taxonomy database and the tax_id is determined. The DBREF data are extracted and sent to the UniProt group who validate those which have UniProt cross-references and determine the UniProt reference for those proteins with only Genbank or EMBL cross-references. During the validation process, the UniProt group can directly access data in the PDBe Production database, via views, which greatly facilitates the validation process. If an existing UniProt entry cannot be found which matches the sequence, then a new TrEMBL entry is created for that sequence. The validated taxonomy and DBREF data are stored in the PDBe Production database.

    Using the validated DBREF, the residue-level mapping is carried out as described above and the validated taxonomy, DBREF and mappings are loaded into the PDBe Production database, which contains the rest of the data for the PDB entries. A series of views in the PDBe Production database, are made available to the UniProt group, who can then automatically access the structural information they need for inclusion in UniProt entries.

    Figure 4

    Because both the PDBe and UNiProt databases are stored in Oracle databases connected by database links, it greatly simplifies the problem of each database keeping track of changes in the other and facilitates the corresponding update of data in each database. For example, the CRC64 of the UniProt sequence at the time the residue level mapping is carried out is held in the MSD database. If the CRC64 of a sequence in the UniProt database changes, it is simple to determine that the sequence needs to be re-mapped in MSD. Similarly it is possible to keep track of when a UniProt accession code becomes secondary or changes in the protein ID or the taxonomy of a particular UniProt entry. This is critical at a time when UniProt is demerging many entries, so that each accession code becomes associated with a single tax_id.

    Primary developers: Sameer Velankar, Harry Boutselakis, Phil McNeil, Dimitris Dimitropoulos, Antonio Suarez (PDBe group) and Virginie Mittard, Daniel Barrell, Julius Jacobsen (Sequence database group).
    Last modified: Fri January 23 13:02:10 BST 2009
    TEMBLOR-European Community Contract No. QLRI-CT-2001-00015 Medical Research Council home page EMBL Heidelberg home page
    spacer
    spacer