The PDBe search database



History

The Macromolecular Structure Group (MSD) is the European project for the collection, management and distribution of data about macromolecular structures. The PDBe has the aim to serve as an alternative complementary and extensible database derived in part from the Protein Data Bank (PDB) and operating under the wwPDB international collaboration.

PDBe and others have recognised long ago the limitations of the PDB flat file format and the need of an extensible framework for macromolecular structure related information.

After taking into account the advances in information management and database technologies over the last decade, PDBe adopted the pragmatic approach of using relational databases in order to support its operations.

The initial step was to develop an internal database that would help with the processing of new PDB entries. This database the "deposition database" is designed following normalisation principles in order to enforce data consistency. After loading, the "consistent" data are exported back to PDB flat files and introduced in the wwPDB repository.

The next step was to use relational database technology in order to offer web services that would allow the external users a toolset for searching and using the PDBe work.

The "deposition database" is not good any more. The focus on a "normalised" design has always to come in expense of simplicity, easy of use and performance. This is often solved by transforming the main archive database to another "data warehouse" database that will de-normalise, aggregate and simplify it. This is exactly the MSDSD (PDBe search database).

It soon became obvious that this database could also serve users that would like to access it directly - even get a replica copy - to use it an alternative to PDB flat files. In that way the could use all the available tools and technologies that are available for relational databases and utilise the power and flexibility of relational database technology and SQL.

The MSDSD

Is a rigid relational database which is a reorganisation of the internal PDBe Deposition database.

The Deposition database itself is used to

The idea behind the PDBe search database is to provide a fast and easy to use public relational database for

How to access and use MSDSD

We expect that the majority of the users will use MSDSD indirectly by accessing some of our online search services available from our website. All these services depend and use the production MSDSD database that we maintain and update on a regular (weekly) basis. Most of these services also depend on several other internal optimisation structures and components that are not part of the MSDSD core. For this, we do not always intend to offer them as a package that one would be able to download and run locally.

For more demanding users of the MSDSD database we have several options for using directly relational operations on MSDSD. The idea is that these users may take advantage of the power and flexibility of database technology in order to utilise the MSDSD in novel ways, and also built on it or extend it independently. The choice of which option to use will depend on the needs and resources such as:

Below there is a summary of the 4 available options we support for using relational operations and SQL directly and their strong (green), not-so-strong(orange), and weak (red) points.


MSD-API and MSD-mine

These are online services that offer direct SQL execution over the web. Both these services will impose limitations on what an individual user may do and the resources (database CPU time, temporary disk space etc) he may use. This is done in order to avoid over-demanding requests that would degrade the availability and performance of our local databases. Users that are not satisfied by what these services offer will have to replicate a local copy of MSDSD using one of the other options described below.
In brief MSD-mine is a web application for interactive exploration of MSDSD. It allows users to interactively build arbitrary queries over MSDSD that can then be also used for interactive data-analysis and data mining. Its main aim is to familiarise users with the MSDSD data.
The MSD-API web service enables developers to query the MSDSD directly from their own application programs in their favourite environment - such as Java, C/C++, Perl using technologies like SOAP and WSDL and is based on Distributed Computing and Grid concepts. The MSD-API offers the full power and flexibility of ad-hoc SQL but needs programming and SQL skills and is available for registered users.
For more information and details follow the corresponding links given above.

Replication on Oracle

MSDSD is free for academic research and can be downloaded from our ftp site.

To obtain a license, please fill an application form and post three copies to:

Dr Melford John
Database administrator
Macromolecular Database Structure
European Bioinformatics Institute
Welcome Trust Genome Campus
Hinxton, Cambridge, CB10 1SD
United Kingdom

This is the most advanced remote replication option that we offer. It is available for registered users that fill in and post a free of charge MSDSD license document.

It uses one of the most advanced and powerful commercial relational database servers and is the option that we recommend for the more serious users of MSDSD and our collaborators. Additionally since we also use it at MSD, we are able to offer more support and advice. For the Oracle replication option we also offer frequent (weekly) increments for users that wish to follow closely the evolution of our local master MSDSD and of the PDB. The disadvantages of this option are that users will need to have an oracle server license, some database administration support and adequate hardware infrastructure.
Typically a user of this replication will download and install the latest full release (full transformation) of MSDSD using the full installation instructions. Such full releases take place on a sparse (yearly) basis, and this is the time of MSDSD reconciliation, since all PDB entries are refreshed and creeping inconsistencies are resolved.
In the meantime between releases (full transformations) the user may run the automatic synchronisation script (typically set in a crontab) that will allow the download and inclusion of increments for the new PDB entries that are released every week.
Any corrections in reference data will not propagate back to the affected old entries in order to keep the increments manageable, so the only time that the full set of MSDSD relational constraints is guaranteed, is only immediately after a full release.
The MSDSD and the incremental updates are organised in sections ("marts") so users are free to install and increment, just the marts that they are interested in. There is also the option to specify which tables of a mart a user wishes to have installed, so users may in general replicate just a few individual tables.


For more information you may contact the PDBe group

Replication on mySQL

This is the alternative open source database replication that we offer. We have chosen to support mySQL instead of other similar alternatives, because at the time it seems to be the easier to install and start with. It also has the fewer platform dependencies and requires almost no system administrator involvement in order to set-up. The idea is to offer something that will require the minimum effort to install and give it a try for a researcher who is not an expert in the IT area and has no dedicated resources and support.
It should be easy to replicate even on a normal desktop workstation with a fair amount of disk space. It also does not bind the user community to a commercial software database vendor.
The disadvantages of this options are that mySQL may not always have the sophistication and speed of a commercial database (for complex queries), we do not offer frequent incrementals and that we do not use it much, so we will not be able to offer as much support and advice.
It is also available for registered users that fill in and post a free of charge MSDSD mySQL license document (in 3 copies).
Typically a user of this replication will download and install directly the mySQL data-files of the tables he is interested in from our FTP server following the mySQL installation instructions. The tables are available in compressed myIsam format without any pre-built indexes.

For more information you may contact the PDBe group

MSDSD and flat files (PDB, mmCif, XML)

A frequently asked question about MSDSD is why the database is not available in XML and other flat file formats (XML,mmCif,clean-up PDB). The reason is that we feel that XML and other flat-files have to be based on a rigid and systematic standardisation in order to be useful. This work is done as part of the wwPDB collaboration and we would advice users to refer to the wwPDB

MSDSD conventions

MSDSD (with some exceptions) is following a standard set of conventions in its design and architecture. Some understanding of these conventions will help anyone interested in learning the MSDSD schema, regardless of the method he chooses to use in order to use it (replication, API). For a more systematic study you will have also to consult the MSDSD reference documentation.

Sections of the PDBe search database

The PDBe search database is organised in interrelated sections. Some of these sections are in the centre of the database, while others may be decoupled and ignored for those that are not interested in them.
 

MSDSD frequently asked SQL