PDBe and others have recognised long ago the limitations of the PDB flat file format and the need of an extensible framework for macromolecular structure related information.
After taking into account the advances in information management and database technologies over the last decade, PDBe adopted the pragmatic approach of using relational databases in order to support its operations.
The initial step was to develop an internal database that would help with the processing of new PDB entries. This database the "deposition database" is designed following normalisation principles in order to enforce data consistency. After loading, the "consistent" data are exported back to PDB flat files and introduced in the wwPDB repository.
The next step was to use relational database technology in order to offer web services that would allow the external users a toolset for searching and using the PDBe work.
The "deposition database" is not good any more. The focus on a "normalised" design has always to come in expense of simplicity, easy of use and performance. This is often solved by transforming the main archive database to another "data warehouse" database that will de-normalise, aggregate and simplify it. This is exactly the MSDSD (PDBe search database).
It soon became obvious that this database could also serve users that would like to access it directly - even get a replica copy - to use it an alternative to PDB flat files. In that way the could use all the available tools and technologies that are available for relational databases and utilise the power and flexibility of relational database technology and SQL.
The Deposition database itself is used to
For more demanding users of the MSDSD database we have several options for using directly relational operations on MSDSD. The idea is that these users may take advantage of the power and flexibility of database technology in order to utilise the MSDSD in novel ways, and also built on it or extend it independently. The choice of which option to use will depend on the needs and resources such as:

To obtain a license, please fill an application form and post three copies to:
Dr Melford John Database administrator Macromolecular Database Structure European Bioinformatics Institute Welcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD United Kingdom
This is the most advanced remote replication option that we offer. It is available for registered users that fill in and post a free of charge MSDSD license document.
It uses one of the most advanced and powerful commercial relational database servers and is the option that we recommend for the more
serious users of MSDSD and our collaborators. Additionally since we also use it at MSD, we are able to offer more support and advice.
For the Oracle replication option we also offer frequent (weekly) increments for users that wish to follow closely the evolution of our local master MSDSD and of the PDB.
The disadvantages of this option are that users will need to have an oracle server license, some database administration support and adequate hardware infrastructure.
Typically a user of this replication will download and install the latest full release (full transformation) of MSDSD using the full installation instructions. Such full releases take place on a sparse (yearly) basis, and this is the time of MSDSD reconciliation, since all PDB entries are refreshed and creeping inconsistencies are resolved.
In the meantime between releases (full transformations) the user may run the automatic synchronisation script (typically set in a crontab) that will allow the download and inclusion of increments for the new PDB entries that are released every week.
Any corrections in reference data will not propagate back to the affected old entries in order to keep the increments manageable, so the only time that the full set of MSDSD relational constraints is guaranteed, is only immediately after a full release.
The MSDSD and the incremental updates are organised in sections ("marts") so users are free to install and increment, just the marts that they are interested in. There is also the option to specify which tables of a mart a user wishes to have installed, so users may in general replicate just a few individual tables.

For more information you may contact the PDBe group

|
This is a consistent and enriched library of ligands, small molecules and monomers that are referred by each residue and atom. There is complete and consistent reference information for any small molecule and aminoacid like for example CPM that includes detailed
information about its atoms and bonds, their standard nomenclature and
ordering, as well as their important characteristics like aromaticity and
stereochemistry. Any atom or residue in any actual structure, that does
not include and follow a reference in an atom or ligand of this dictionary,
is simply unidentified and requires cleanup. |
|
This is where the big and important volume of information is included. This section is organised in 3 different interrelated hierarchies that facilitate different points of view a) The sequence
point of view (denoted with blue arrows). The information in this hierarchy
is about the sequence and chemistry of the protein and does not relate
with the 3-D folding of this sequence. A molecule corresponds to the sequence
of a chain but it is possible to have more than one chain in the PDB asymmetric
unit that are slightly different foldings of the same molecule as these
were observed in the experiment. The atom is again the abstract notion
of a chemical atom that ignores alternative configurations or different
NMR models. These are useful in relationships where the actual coordinates
are not of interest, like the source organism of the molecule etc.
b) The PDB asymmetric point of view (denoted with green and the green-orange arrows). This is the view of the observed structure as is available in the PDB entry. The asymmetric chains are also reused in assemblies but are marked with a special non-symmetric-valid flag, that specifies that are also valid regardless of the assembly where they belong. This information is more useful when different chain structures are needed regardless whether they are actually the same molecule and whether they have any interactions between them. c) The assembly point of view that corresponds to the actual quaternary biological entity. This represents what should be considered as the actual complete structure and is useful when the actual inter-chain and ligand interactions are significant. For example the assembly in entry 1b01 above form a barrel like sheet in the middle of the structure that includes strands from different chains and becomes apparent after the assembly transformation of chains. As an example the entry 1b01 has 5 chains in the asymmetric unit (A,B,C,D,E). These chains form 3 assemblies, assembly 1 with chains (A,A1,A2,B,B1,B2), assembly 2 with chains (C,C1,C2,D,D1,D2) and assembly 3 with chains (E,E1,E2,E3,E4,E5). Chains A and B from assembly 1, C and D from assembly 2 and E from assembly 3, are also marked as non symmetric valid and they may be used to extract the original PDB asymmetric unit. Additionally all bound molecules and water groups are defined in separate
chains, named after and associated to the protein chains that have the
stronger interaction with. During the process of assembly formation,
bound molecules and waters may be replicated several times, as long as
they have some form of interaction with the assembly.
|
|
This is a section of the database that keeps detailed information about the secondary structure for common things like sheets and helices up to more extended formations like bulges, hairpins and motifs. For each entry there may be one or more sets of secondary structure information from different sources. Since the secondary structure is not always available in PDB entries and its source or accuracy is not consistent, the secondary structure of all entries has been re-derived using directly the coordinates of the structure as a source to DOSS, a secondary structure prediction program - based on DSSP(W.Kabsch C. Sander(1983) Biopolymers 22:2577-2637) / Promotif [Gail Hutchinson and Janet Thornton 1996], in order to provide an consistent platform for comparisons and analysis of secondary structure. The starting point for deriving the secondary structure information is not the PDB asymmetric unit, but the actual quaternary structure (the assembly), in order to be able to identify secondary structure elements related to more than 1 chain in the assembly. For example in entry 1b01 there is a barrel like sheet in the middle of the structure that includes strands from 3-D transformed chains that originate from a single chain of the asymmetric unit |
|
Information about the active sites of the macromolecule, and the way that ligands and drugs bind to a protein. Again since the related information sometimes available in the PDB entries is not consistent and trustworthy, site information is calculated internally in PDBe [Golovin, A., Dimitropoulos, D., Oldfield, T., Rachedi, A. and Henrick, K. (2005) PROTEINS: Structure, Function, and Bioinformatics 58(1): 190-9.] (http://www.ebi.ac.uk/msd-srv/msdsite/index.jsp). The active sites of a protein chain are determined based on the contacts of the chain with a ligand. There are many ways that contacts are defined based on different types of bonds and interactions, that take into account the distance and angles of the atoms, as well as other characteristics of the ligands and residues like planes. An active site can be defined not only for a particular atom, but also for a plane of a molecule. |
|
A lot of work has also been done to provide complete and consistent cross-references with external database like Swiss-prot, SCOP, CATH, EC Enzyme, Gene ontology, Medline and NCBI taxonomy databases [Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, A., Barrell, D., Apweiler, R. and Henrick.K. (2005) Nucleic Acids Res. 33 (Database Issue)]. The cross-references are established to the most suitable detailed level (for example on a residue by residue basis for Swiss-prot, since the same chain may be referenced by two different Swiss-prot entries) but are also often aggregated to facilitate data analysis on a higher level. For more details on the broader context of this effort you may refer to the eFamily web-site. |
* SELECT CHEM_COMP_ID,CHEM_COMP_CODE,CODE_3_LETTER,FORMULA,NUM_ATOMS_ALL,FORMAL_CHARGE,STEREO_SMILES,NAME FROM CHEM_COMP WHERE CHEM_COMP_CODE='ATP' * SELECT CHEM_ATOM_ID,NAME,ELEMENT_SYMBOL,CHARGE,CHIRALITY,DEFAULT_MODEL_X,DEFAULT_MODEL_Y,DEFAULT_MODEL_Z FROM CHEM_ATOM WHERE CHEM_COMP_CODE='ATP' /* or CHEM_COMP_ID=794 /* /* optinally: AND ELEMENT_SYMBOL!='H' for non-hydrogen */ ORDER BY ORDERING; * SELECT CHEM_BOND_ID,CHEM_ATOM_1_NAME,CHEM_ATOM_2_NAME,CHEM_BOND_TYPE,EXTENDED_TYPE,STEREOCHEM FROM CHEM_BOND WHERE CHEM_COMP_CODE='ATP' /* or CHEM_COMP_ID=794 / ORDER BY CHEM_ATOM_1_ORDERING,CHEM_ATOM_2_ORDERING;
* SELECT ASSEMBLY_ID,ASSEMBLY_SERIAL,ASSEMBLY_TYPE,ASSEMBLY_CLASS,ASSEMBLY_FORM,ASSEMBLY_TITLE,NUM_CHAINS,SCORE FROM ASSEMBLY WHERE ACCESSION_CODE='1dn0'; ASSEMBLY_ID ASSEMBLY_SERIAL ASSEMBLY_TYPE ASSEMBLY_CLASS ASSEMBLY_FORM NUM_CHAINS SCORE -15203 0 0 17668 1 DIMERIC HE [AB] 2 -8 17669 2 TETRAMERIC HE [A2B2] 4 -10Note: Assemblies with assembly serial 0 are not real biological assemblies. The serve for legacy purposes us placeholders for
SELECT CHAIN_ID,ASSEMBLY_SERIAL,CHAIN_CODE,CHAIN_TYPE,ASSOCIATED_CHAIN_CODE,NON_ASSEMBLY_VALID, PDB_CODE,CHAIN_CODE_1_LETTER,CHAIN_INCR_1_LETTER, NUM_RESIDUES,MOLECULE_CODE,CHAINMOL_SERIAL,MOLECULE_NAME FROM CHAIN WHERE ACCESSION_CODE='1dn0' ORDER BY ASSEMBLY_SERIAL,DECODE(CHAIN_TYPE,'C',1,'B',2,3),CHAIN_CODE; -93483 0 AW W Y A 0 1 3 5 Solvent 38473 1 A C Y A A 0 215 1 2 IGM-KAPPA COLD AGGLUTININ (LIGHT CHAIN) 38474 1 B C Y B B 0 232 2 2 IGM-KAPPA COLD AGGLUTININ (HEAVY CHAIN) 26194066 1 AW W A N D 0 191 3 1 Solvent 26193928 1 AW1 W B N C 0 239 3 2 Solvent 38475 2 C C Y C C 0 215 1 3 IGM-KAPPA COLD AGGLUTININ (LIGHT CHAIN) 38476 2 C1 C N C A 0 215 1 1 IGM-KAPPA COLD AGGLUTININ (LIGHT CHAIN) 38477 2 D C Y D D 0 232 2 3 IGM-KAPPA COLD AGGLUTININ (HEAVY CHAIN) 38478 2 D1 C N D B 0 232 2 1 IGM-KAPPA COLD AGGLUTININ (HEAVY CHAIN) 26193936 2 AW W C N E 0 332 3 3 Solvent 26193942 2 AW1 W D N F 0 418 3 4 Solvent
SELECT CHAIN_CODE,CHAIN_TYPE,ASSOCIATED_CHAIN_CODE,PDB_CODE,CHAIN_CODE_1_LETTER, NUM_RESIDUES,MOLECULE_CODE,MOLECULE_NAME FROM CHAIN WHERE ASSEMBLY_ID=17669 ORDER BY CHAIN_CODE; AW W C E 332 3 Solvent AW1 W D F 418 3 Solvent C C C C 215 1 IGM-KAPPA COLD AGGLUTININ (LIGHT CHAIN) C1 C C A 215 1 IGM-KAPPA COLD AGGLUTININ (LIGHT CHAIN) D C D D 232 2 IGM-KAPPA COLD AGGLUTININ (HEAVY CHAIN) D1 C D B 232 2 IGM-KAPPA COLD AGGLUTININ (HEAVY CHAIN)
SELECT CHAIN_ID,CHAIN_CODE,CHAIN_TYPE,PDB_CODE,CHAIN_CODE_1_LETTER,NUM_RESIDUES,MOLECULE_CODE,MOLECULE_NAME FROM CHAIN WHERE ENTRY_ID=15203 /* or ACCESSION_CODE='1dn0' */ and NON_ASSEMBLY_VALID='Y' ORDER BY CHAIN_CODE; 38473 A C A A 215 1 IGM-KAPPA COLD AGGLUTININ (LIGHT CHAIN) -93483 AW W A 1 3 Solvent 38474 B C B B 232 2 IGM-KAPPA COLD AGGLUTININ (HEAVY CHAIN) 38475 C C C C 215 1 IGM-KAPPA COLD AGGLUTININ (LIGHT CHAIN) 38477 D C D D 232 2 IGM-KAPPA COLD AGGLUTININ (HEAVY CHAIN)
SELECT CHAIN_ID,CHAIN_CODE,CHAIN_TYPE,PDB_CODE,CHAIN_CODE_1_LETTER,NUM_RESIDUES,MOLECULE_CODE,MOLECULE_NAME FROM CHAIN WHERE ENTRY_ID=15203 /* or ACCESSION_CODE='1dn0' */ and chainmol_serial=1 ORDER BY CHAIN_CODE; 26194066 AW W A 191 3 Solvent 38473 A C A A 215 1 IGM-KAPPA COLD AGGLUTININ (LIGHT CHAIN) 38474 A C A B 232 2 IGM-KAPPA COLD AGGLUTININ (HEAVY CHAIN)
SELECT RESIDUE_ID,CHAIN_CODE,SERIAL,CHEM_COMP_ID,CHEM_COMP_CODE,CODE_3_LETTER, PDB_SEQ,PDB_INSERT_CODE,PDB_CODE FROM RESIDUE WHERE ACCESSION_CODE='1dn0' AND ASSEMBLY_SERIAL=1 AND CHAIN_CODE='A'
SELECT RESIDUE_ID,ACCESSION_CODE,CHAIN_CODE,SERIAL,CHEM_COMP_ID,CHEM_COMP_CODE,CODE_3_LETTER, PDB_SEQ,PDB_INSERT_CODE,PDB_CODE FROM RESIDUE WHERE PDB_CODE!=CODE_3_LETTER AND NON_ASSEMBLY_VALID='Y'
SELECT RESIDUE_ID,ACCESSION_CODE,CHAIN_CODE,SERIAL,CHEM_COMP_ID,CHEM_COMP_CODE,CODE_3_LETTER, PDB_SEQ,PDB_INSERT_CODE,PDB_CODE FROM RESIDUE WHERE CHEM_COMP_ID IS NULL
SELECT ATOM_DATA_ID,MODEL_SERIAL,CHAIN_CODE,CHAIN_CODE_1_LETTER,CHAIN_PDB_CODE,
RESIDUE_SERIAL,RESIDUE_PDB_SEQ,RESIDUE_PDB_INSERT_CODE,
CHEM_COMP_CODE,CODE_3_LETTER,
CHEM_ATOM_NAME,CHEM_ATOM_NAME_PDB_LS,
ORIG_X,ORIG_Y,ORIG_Z,ALT_CODE,OCCUPANCY FROM ATOM_DATA WHERE
ACCESSION_CODE='1dn0' AND NON_ASSEMBLY_VALID='Y'
ORDER BY MODEL_SERIAL,CHAIN_CODE,RESIDUE_SERIAL,CHEM_ATOM_ORDERING
SELECT ATOM_DATA_ID,MODEL_SERIAL,CHAIN_CODE,CHAIN_CODE_1_LETTER,CHAIN_PDB_CODE,
RESIDUE_SERIAL,RESIDUE_PDB_SEQ,RESIDUE_PDB_INSERT_CODE,
CHEM_COMP_CODE,CODE_3_LETTER,
CHEM_ATOM_NAME,CHEM_ATOM_NAME_PDB_LS,
X,Y,Z,ALT_CODE,OCCUPANCY FROM ATOM_DATA WHERE
ASSEMBLY_ID=17668
ORDER BY MODEL_SERIAL,CHAIN_CODE,RESIDUE_SERIAL,CHEM_ATOM_ORDERING
SELECT ATOM_DATA_ID,CHAIN_CODE,CHAIN_CODE_1_LETTER,CHAIN_PDB_CODE,
RESIDUE_SERIAL,RESIDUE_PDB_SEQ,RESIDUE_PDB_INSERT_CODE,
CHEM_COMP_CODE,CODE_3_LETTER,
CHEM_ATOM_NAME,CHEM_ATOM_NAME_PDB_LS,
X,Y,Z,ALT_CODE,OCCUPANCY FROM ATOM_DATA WHERE
ACCESSION_CODE='1olg' and MODEL_SERIAL=1
ORDER BY CHAIN_CODE,RESIDUE_SERIAL,CHEM_ATOM_ORDERING
SELECT CONCAT(
RPAD("ATOM", 6, " "),
LPAD(SERIAL, 5, " "),
" ",
LPAD(CHEM_ATOM_NAME, 4, " "),
IF(ALT_CODE IS NULL, " ", ALT_CODE),
CODE_3_LETTER,
" ",
IF(CHAIN_PDB_CODE IS NULL, " ", CHAIN_PDB_CODE),
LPAD(RESIDUE_SERIAL, 4, " "),
IF(RESIDUE_PDB_INSERT_CODE IS NULL, " ", RESIDUE_PDB_INSERT_CODE),
REPEAT(" ", 3),
LPAD(X, 8, " "),
LPAD(Y, 8, " "),
LPAD(Z, 8, " "),
LPAD(OCCUPANCY, 6, " "),
REPEAT(" ", 6),
REPEAT(" ", 6),
RPAD(CHAIN_CODE, 4, " ")
) AS atom_lines
FROM ATOM_DATA WHERE (ACCESSION_CODE = "1olg") AND
(ASSEMBLY_SERIAL = 1) AND (MODEL_SERIAL = 1)
ORDER BY CHAIN_CODE, RESIDUE_SERIAL, CHEM_ATOM_ORDERING, SERIAL;
SELECT ACCESSION_CODE,CHAIN_CODE,MODEL_SERIAL,HELIX_ID,HELIX_SERIAL,NUM_RESIDUES, LINEARITY,BEG_RESIDUE_SERIAL,BEG_CHEM_COMP_CODE,END_RESIDUE_SERIAL,END_CHEM_COMP_CODE FROM HELIX WHERE ACCESSION_CODE='1dn0' AND NON_ASSEMBLY_VALID='Y'
* SELECT ACCESSION_CODE,CHAIN_CODE,RES_SEQ FROM CHAIN_ALL_SEQ WHERE ACCESSION_CODE='1dn0' * SELECT ACCESSION_CODE,CHAIN_CODE,RES_SEQ FROM CHAIN_OBS_SEQ WHERE ACCESSION_CODE='1dn0'