 |
SIFTS Methods
|
| Figure 1 | |
|
We have adopted the NCBI taxonomic identifiers as a
standard way of representing the taxonomy information for all of the
PDB entries within the PDBe database. In the ideal case every PDB entry
should have a record of the organism from which each component of this
particular structure derives, but in the legacy archive the situation
is far from ideal: many entries simply have no such record, whilst
those records that are present have historically been prone to
typographical or spelling errors. For entries with no taxonomy
information, manual searches of the PDB file or accompanying
literature were performed and for all entries we have put in place
mechanisms that automatically check the user-supplied taxonomy
information against the NCBI database, using the standard NCBI
taxonomy identifier that we assign to each PDB entry. This allows us
to correct spelling mistakes in legacy PDB files and to identify PDB
entries where the taxonomy information is simply
incorrect. Furthermore by using a stable, curated taxonomy identifier
throughout the database, we gain access to the wealth of annotation
information in the NCBI database or UniProt Taxonomy database, such as synonyms and
hierarchical relationships between different taxonomic nodes. Figure 1
shows the database schema for taxonomy data mart in the PDBe
database.
The cleaned-up taxonomic information for every macromolecular structure
is available in the XML files from the FTP archive.
|
|
We have used sequence identity and taxonomy as the characteristics
on which to link protein sequence data (from UniProt) and protein
structure data (from PDBe).
Since the sequences of a structure in the PDBe may represent either
the native protein sequence or that of an engineered mutant or other
variant, during the automatic procedure, the criterion for assessing
sequence identity was that there should be 95% or higher agreement
between the sequence of a protein structure and the corresponding
sequence in UniProt. If no match is found then this criterion for
sequence identity was relaxed further down to 90% during the
manual annotation. For entries which are not represented in the
Uniprot archive, new Uniprot entries were created based on the
information given in the PDB entry.
Because protein structure is more conserved across evolutionary time than
is protein sequence and the structural differences between proteins with
high sequence identity are small, the rule for assessing taxonomy
assignments to accept the correct Uniprot cross-reference was relaxed to
allow the taxonomy ID for the two entries, PDBe and UniProt, to be the same
or to have a common parent within one or two levels up the taxonomic tree.
Using the above rule, we have also cleaned up the UniProt cross-references
for every entry in the PDB.Figure 2 shows the database schema for
Uniprot cross reference data mart in the PDBe database.
|
|
| Figure 2 |
|
The clean-up of the Uniprot cross references has allowed us to link the
macromolecular structure information to other important data resources such as-
GOA database which provides assignments of gene products to the Gene Ontology (GO) resource
Interpro database which provides information on protein families, domains and functional sites.
Pfam database which is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families.
IntEnz database which provied up-to-date information on Enzyme Nomenclature
SCOP database of protein families based on protein structure
CATH database which is a hierarchical classification of protein domain structures
Pubmed citation database
|
|
Residue level mapping
|
|
After completing the clean-up of archive, it was possible to map accurately
the sequences from PDB entries on to corresponding UniProt entries. The main
difficulty in determining this mapping is that many structures in the PDB
have regions of unobserved residues in the middle of continuous polypeptide
chains. This discontinuity in the sequence of the structure arises because
it is often impossible to reliably construct a model for poorly defined
regions of structure, such as flexible loops. Such gaps in the sequence are
not taken into account by traditional sequence alignment algorithms, leading
to incorrect alignments for regions flanking the unobserved regions.
To circumvent this problem we modified the standard alignment protocol and
developed software to use sequences of connected segments of a polypeptide
chain from the PDB entry, corresponding to the observed regions of a protein
structure. The separate alignments for these segments were then merged
together to assemble the complete alignment between the sequence of the
observed residues from the PDB entry and the complete sequence of the protein
that was used in the experiment. This latter sequence is shown in the SEQRES
record in the PDB entry and does not have gaps reflecting unobserved residues.
A similar procedure was carried out to obtain alignments between the sequences
of observed residues and the corresponding UniProt entry. These two composite
alignments were then merged to give the complete residue -level mapping
between the sequence of the complete polypeptide from the experiment and its
UniProt counterpart. This complex procedure also allows us to extract
annotations from the PDB and UniProt entries to explain any differences that
were detected between the two sequences, such as variants, isoforms, modified
residues or engineered mutations. Unobserved residues and N- or C-terminal
tags for the polypeptide chains in the PDB entry are also annotated. Regions
from the UniProt entry that do not form part of the studied polypeptide and
are not included in the PDB entry are clearly marked.
The program also copes with the more complex situation in chimeric structures,
where sequences from two or more UniProt entries are involved. In this case
the correct boundaries are manually confirmed and this information is stored
in a temporary table in the database. The programme uses this information to
identify the correct alignments for each segment of the polypeptide chain.
|
|
|
Data update mechanism
|
|
|
Both the PDBe and Uniprot groups have developed relational databases to
store their data. The databases are implemented in
Oracle and are used
as the primary archival system for the data. This has allowed us to
use various mechanisms provided by Oracle to exchange information
between the two databases without exporting the data into flat-files.
Figure 3 shows how the data is exchanged between the databases.
|
|
Figure 3
|
|
When new PDB entries are deposited, the source taxonomy is
validated against the NCBI taxonomy database and the tax_id is
determined. The DBREF data are extracted and sent to the UniProt
group who validate those which have UniProt cross-references and
determine the UniProt reference for those proteins with only
Genbank or EMBL cross-references. During the validation process,
the UniProt group can directly access data in the PDBe Production
database, via views, which greatly facilitates the validation
process. If an existing UniProt entry cannot be found which
matches the sequence, then a new TrEMBL entry is created for
that sequence. The validated taxonomy and DBREF data are stored
in the PDBe Production database.
Using the validated DBREF, the residue-level mapping is carried out as
described above and the validated taxonomy, DBREF and mappings are
loaded into the PDBe Production database, which contains the rest of
the data for the PDB entries.
A series of views in the PDBe Production database, are made available
to the UniProt group, who can then automatically access the structural
information they need for inclusion in UniProt entries.
|
|
|
Figure 4
|
|
Because both the PDBe and UNiProt databases are stored in Oracle
databases connected by database links, it greatly simplifies the
problem of each database keeping track of changes in the other
and facilitates the corresponding update of data in each
database. For example, the CRC64 of the UniProt sequence at the
time the residue level mapping is carried out is held in the MSD
database. If the CRC64 of a sequence in the UniProt database
changes, it is simple to determine that the sequence needs to be
re-mapped in MSD. Similarly it is possible to keep track of when
a UniProt accession code becomes secondary or changes in the
protein ID or the taxonomy of a particular UniProt entry. This
is critical at a time when UniProt is demerging many entries, so
that each accession code becomes associated with a single
tax_id.
|
Primary developers: Sameer Velankar, Harry Boutselakis, Phil McNeil, Dimitris Dimitropoulos, Antonio Suarez (PDBe group) and Virginie Mittard, Daniel Barrell, Julius Jacobsen (Sequence database group).
Last modified: Fri January 23 13:02:10 BST 2009
|
|
|