spacer
spacer

Protein DataBank in Europe Overview

What is the PDBe, and what do we do?

PDBe is short for "Macromolecular Structure Database"

Introduction

The aim of the PDBe group is to receive experimental data from scientists (via the deposition pages), curate and add these to our internal Oracle Database, and then return information to the community. The PDBe group is principally involved with the data associated with macromolecular structure associated with the metabolism of living organisms, that is, the atomic coordinates of proteins, nucleic acids and molecules that bind with these. We also maintain links to protein sequence information, textual information from scientific publications and a number of derived properties that augment the macromolecular structure information. The macromolecular coordinate data is collectively known as the "protein structure databank" (PDB) although we provide far more information by providing a searchable database of this and links to other information.

The PDBe group works in a number of areas, these include Deposition, Curation, Database generation and maintenance, Search Systems and other Projects. What the user sees is a series of services that allows them to use this work.

Deposition

Deposition is the name give to the process of the submission of experimental data (currently from crystallography (X-ray), Nuclear magnetic Resonance (NMR) or Electron Microscopy (EM)) by scientists to a central site that stores this data. We are one of 3 deposition sites internationally for macromolecular data: the others are the RCSB and the PDBJ. Data deposited with any of the 3 sites is exchanged weekly so that all sites hold all the public data within a week of deposition. The three different experimental techniques have their own deposition services, two of these (for X-ray and EM) are held at the PDBe along with similar services at the RCSB and PDBj. Experimental data generated by the technique of Nuclear Magnetic Resonance (NMR) can be deposited at the BioMed res bank (BMRB) :

  1. Autodep : Autodep is designed to allow molecular coordinates data generated by the experimental procedure of protein crystallography.
  2. EMDB : EMDB is designed for the deposition of data collected using the technique of Electron microscopy.

Curation

Part of the process of deposition of structure data using the deposition service is curation. Curation is the process whereby the experimental details are checked and studied by one of our staff and they confirm that the deposited details are suitable for entry into the PDB. The curator works with the depositor of the data for the smooth and coherent submission of the experiment. It should be remembered that our aim to return information to the community, so it is important the data we hold is useful the community

Database generation and maintenance

A significant part of a work carried out at the PDBe is associated with the design and maintenance of databases (DB). Macromolecular structure, our principle concern, is highly complex and requires a highly complex and large database. The Oxford English Dictionary (OED) definition of a database is a large collection of data organised especially for rapid search and retrieval of information and we like to think that our DB's reflect this definition very closely. The critical words in this definition of a DB are :

  1. Organised : Relationships and hierarchy are created within the DB data to optimize its use and reflect the data.
  2. Rapid : We assume that the rate of data deposition is going to increase and that the size of the PDB will therefore grow significantly. It is therefore imperative to organize the data so that we can continue to retrieve information from the database rapidly now and in the future.
  3. Retrieval : The database has been organised not for our benefit but for a user to retrieve requested information. There is subtle but distinct difference between making it easy for us and easy for the end user.
  4. Information : We are not returning data, that is easy, the aim of our services is to provide information to the user - something that educates the user.

We have created a number of databases that have specific roles to manage and hold data associated with the PDB. The work carried out at the PDBe group therefore involves DB design, implementation, addition of links to other DB and data addition of these databases. It is also necessary not only to add the new depositions to this database, but also include the legacy data (those coordinates generated many years ago) to these databases - every last 10% of the data taking 90% of the time. We have based our databases on Oracle since this is a highly regarded and ubiquitous software platform suitable for holding the very large amounts of data.

The search systems

The search systems have been designed to allow return of information from the search database. Notice the distinction that we collect data from scientists, but we return information to scientists; this is one of the critical aspects of a service we have generated and strive to work on. There are six different search engines :

  1. PDBepro : the advanced general search interface
  2. PDBelite : the text base general search interface
  3. PDBechem : the search interface for the residue dictionary
  4. PDBesite : the search interface to the active site database
  5. PDBetarget : the search interface to the target database
  6. PDBefold : Secondary structure alignment search

The first two search systems have been designed to be general and aimed at the novice user. Because of this they were initially designed to be completely dictionary based, and so the inclusion of new search keys represents a trivial process of new dictionary terms. Our aim is to be able search all the database using these generic search systems allowing the link between structure, sequence, text, function data to produce extremely complex queries. The search systems 3-6 are specific to the databases they act on and allow detailed analysis of the data contained within the respective database. These search systems are designed to be expert systems.

Projects

Data model design

One issue that is important to consider is that we are neither experts in, or can hold data on, all the different types of biological data. This means that we must be able to "talk" to other databases to provide a truly integrated system. This therefore requires that we can talk in a language that the other databases can understand, and so we need to design a data exchange format. We are collaborating with a number of groups regarding the design of ways of transporting information between databases. We are looking into how we define protein structure using XML, and working with UniProtKB/Swiss-Prot on how we understand sequence details.

Application Programming Interface

A project is being undertaken to develop an Application Programming Interface (API) to the EBI-PDBe database. This consists of a series of functions that will allow external 3rd party software to access the EBI-PDBe database independently. This is based around a SOAP-XML based messaging system where SOAP's latest definition is as follows:

SOAP is a lightweight protocol intended for exchanging structured information in a decentralised, distributed environment. SOAP uses XML technologies to define an extensible messaging framework, which provides a message construct that can be exchanged over a variety of underlying protocols. The framework has been designed to be independent of any particular programming model and other implementation specific semantics.

The SOAP system we have generated allows a user to write their own applications in a number different computer languages (C, C++, Java, Fortran, Perl) and use calls to our data within the database. This allows the remote and direct use of the data for users and their own specific applications.

Data harvesting

Data harvesting is the process of recording all the data necessary to reproduce an experiment, and in the case of biological data this includes the source of the material, all the processing steps, and up to the validation of the final experiment that generates the structure coordinates. This is known as "cradle to grave" data recording. The idea of data harvesting is to remove the recording of basic details of the process from the scientist so that we gain a systematic and complete set of information.

Validation

All experiments require some sort of validation and the generation of coordinate data is no exception. We are coordinating a number of procedures that provide data on the quality of the information that we store within the PDBe database.

Application development

There are a number of on-going projects to develop various algorithms and services. This includes the deposition systems (in particular EMdep), and methods such as structure alignment for proteins, ligands and active sites.

Our web Services

Our home page is http://www.ebi.ac.uk/pdbe/, but you are probably aware of this as you have already reached this page. We have 7 key links amongst those on our home page.  

  1. Searches : There are 10 services accessible from this page, where we can give you back the information from our databases.
  2. Depositions : This service allows macromolecular crystallographers and Electron microscopists to deposit their data via the Autodep and EMDB facilites.
  3. Documentation : Supporting information.
  4. Projects : Provides links to our main projects at the PDBe.
  5. Funding : Who gave us the money to do this work.
  6. News & Events : What is happening such as workshops, conferences and new services.
  7. Resources: Other information such as available software from the EBI.

We provide a number of services, these are arranged around our central database resources. This starts with the deposition of data generated by experiment and ends with the return of the information using our search systems. Our search systems have been designed for the user catagories expert user and novice scientist".

Databases - a users view

A database is defined as "a large collection of data organised especially for rapid search and retrieval" whereas a databank is "a collection of data"(OED). At the PDBe we have created databases that have been organised so as to recognize the intrinsic relationships and hierarchy of macromolecular coordinate data. It is fundamental to the concept of the database that the data held are self-consistent, readily searchable and clean. There are a number of different databases held at the EBI and a detailed service view can be found here (link to Dimitris pages); this desciption is an overview of what the user can "see".

The deposition database

The deposition database is organised to maximize the integrity of the data. The deposition database's primary aim is to define a clean and highly organised data collection. It is necessary that we confirm that all the details of the PDB are correct and sensible and can be searched within the search database. For instance :

  1. There are 74 different entries within the PDB for the expression vector "Escherichia Coli" many of these are spelling mistakes. We have created a mapping to a single definition for this data, but also retain the legacy information so that we can return the original entry information if requested
  2. There is no standard nomenclature for "ligands" within the PDB, which makes comparison and analysis of this data difficult. We therefore have a single dictionary for each residue within the PDB and therefore rename all identical ligands (by mathematical graph and stereochemistry) to the same name and atom name.
  3. The deposited atomic positions are produced by experimental techniques, the principal being Xray crystallography. The deposited coordinates are therefore a correct solution to crystallography and as such consist of a single asymmetric unit. Our aim is to return the structure information as biological context molecules, that is, we generate the tetrameric structure of haemoglobin from the crystallographic diamer using the program PQS. Therefore, although we provide a download link for the original deposited PDB files (from the Atlas pages), all the information presented and visualised is based on the Biological context of the data.

The deposition database is highly normalised, such that it contains many relationships and dictionary definitions and much of the data is only stored once in the dictionary. For example, consider the residue alanine; we store a single dictionary definition for this alanaine, and only store the separate coordinates, temperature factors and occupancy values for each different alanine. Thus, we define the deposition database as highly normalised. This database has 143 "foreign keys" - Oracle speak for these relationships. The primary aim of this database is therefore to enforce an internal consistency on the data and make the data "clean". Getting data into this deposition database is therefore a difficult step, particularly with "legacy" or older structures where we find some interesting protein properties.

Chemistry dictionary (PDBechem)

The chemistry dictionary database holds the single residue definitions for the PDB. Each residue has a single entry within this database, although separate and different entries exist for different stereo isomers of a residue. This database contains many properties for each residue such as smiles strings, mathematical graphs, bit strings and such like, and can be search and analysed directly via the PDBechem search interface. The database represents the residue definition for all other databases at the EBI.

Search database

The aim of this database is to provide a searchable database for the community, and use of PDBelite and PDBepro interfaces interact with this database directly. The search database is derived by the process of database "transformation" from the deposition database, and is a "de-nomalised" representation of the same data. This is Oracle-speak for the process of rebuilding each alanine residue so that there are multiple complete copies for each occurrence of this residue. This makes the database larger, less complex and much for efficient for its fundamental role of returning information as fast as possible.

Active site database

The active site database is our first derived information database. The active site database is generated by analyzing the environment around all the ligands found within the deposition database. This allows the ability find all occurrences of Zinc finger active sites using this information, or the statistical distribution of each residue type of interaction with ATP. The search is available using the PDBesite interface from the search page. This is an extremely powerful tool since it provides a window to one of the main roles of macromolecular structure - the function and in particular, binding to substrates. It should be noted that this database does not contain entries for Apo-proteins, those active sites that are similar to one containing a ligand, but does not itself contain a ligand. This database represents our first of many pieces of derived information that will be extracted from the PDB data.

Target database

The Target Information database stores Structural Genomics targets, and there is a dedicated interface to allow a user to search this data.. These include public domain SG and PDB including pre-release sequences. The targets are mainly protein sequences so that a given search query contains related links to structural data (PDB). Tracking targets with this server is done by either sequence similarity between user's sequence(s) and SG targets or by direct detailed search (i.e. by searching for targets by their status, protein name, or organism source .. etc.).

Search systems

A database is of no use without a means to search and retrieve information from it. We have a number of web search interfaces at the current time, although our aim is to integrate these as much as possible to provide the ability to search on a number of different pieces on information in a connected way.

A search system can be as simple as say the PDBefold interface, where a Perl script is used to submit calculations to our computer farm to return structure alignment information. It can also be as complex as the PDBepro search system that provides the user with a graphical view of the database content and allows the combination of search terms (including structure and sequence alignment) using logical statements such as AND, OR and NOT. The multiple interfaces have been generated to try and provide access for the expert user and the novice scientist. By definition, a search system that provides access to all aspects of biological data (structure, sequence, active site, published abstracts) must be accessible and understood by the novice scientist since the biological science is highly diverse and a huge field.

The design of these search systems has concentrated on some primary aims; we have not yet succeeded in all of these - but we are working on it.

  1. A user must receive an answer before they have forgotten the question. How many times have you gone from one room in a house to get something (like a set of keys), only to enter a second room and completely forgotten the reason you were there! This is because it appears we have a number of different memory states; the short-term memory only lasts between 6 and 12 seconds. If you have not done the intended task - or placed that thought into the medium term memory (by being reminded of the question) within this time then you simply forget the "question". A database must therefore return the answer within about 5 seconds (we stipulate actually 3 seconds for our database) or it becomes it necessary to tell the user of our database what the question was they asked. Allowing to user to ask and get answers to question within this time allows them to browse using just their short term memory and defines the term "interactive".
  2. No more than 6 items of information can be presented to the user at any one time or the user gets "information overload". The problem here is what the user perceives as a piece of information. To an expert in protein sequence analysis and alignment, a single item would be the alignment of many sequences. To a non-expert in sequence analysis this would appear as many 1000's of single characters on a screen. To a structure expert, a single protein 3D-stick model would represent a single item of information, to anyone else that would appear as a mess of lines and balls of colour. To solve this problem we have moved into the field of visualisation.
  3. Works on any machine the user is using with no effort required to install or make changes to their system. We have a design aim to support as many different machines, operating systems and browsers as possible. We also believe that users should not have to install any applications to use our web site, particularly as large companies and institutes often do not allow non-IT staff to make changes to their computers. This is of course a problem as the many different browsers behave differently, and in particular older versions can be very limiting. Only the PDBepro requires a version of Java (1.41) that is likely to require installation on older systems, although we hope to reduce this restriction at some stage. The other search and display systems are either simple HTML or use Java version 1.1 which is generally available by default.
  4. The search system must be robust and extensible. We have tried to design the search system so that it works on multiple web nodes and computers and continue to work all the time. We also have to assume that the amount of data we will have to store and search will rapidly increase in volume.

Different users.

We can define different classes of users of our site:

  1. The expert user : this is the scientist who is looking at data that they already have a good understanding. For example, the structure biologist looking at protein structure coordinates. We can assume that they understand the current terminology and be able to use the detailed search and visualization systems if they conform to the systems they are used to. The expert users usually want to get at the detail.
  2. The novice scientist : this is the scientist who is looking at data that is not the principal focus of their work. For example, the structure biologist looking at protein sequence alignment. We can assume that this scientist can pick up ideas and concepts quickly as long as they are presented and explained in general scientific terms. We could also include the university student working towards degree/doctorate within structural biology or related subject.
  3. The non-scientist : this is a large range of people, ranging from schools to any member of public browsing the internet. This very large group of the population is very difficult to cater for as structural biology requires understanding of atoms (from physics), chemical reactions (chemistry), biological systems (biology), and a great deal of mathematics (3D objects, vector algebra..). One of our aims is to bring this subject to a wider audience.

We are therefore concentrating on the expert users and the novice scientists at the moment with different search systems aimed at these two groups of users.

Expert systems

We define an expert system as a resource that is available to scientists that allows them to study data and return information within the field of their expertise. These systems are unlikely to be informative to scientists within other fields of science as they require detailed knowledge to both define a query and also understand the answer.

The expert systems available at the PDBe consist of

PDBesite : the active site database. A comprehensive interface that provides possibilities to research many aspects of active sites in proteins/nucleic acids, both in terms of the ligands and also the environment about these ligands. This will be of most interest to structural biologists as the emphasis the local 3D environment about ligands.

PDBetarget : the target database.

PDBefold : the fold database. This service provides the ability to identify similar folds within the current published protein structures. The interface allows submission of a set of coordinates which is then screened using the program SSM (ref) against the PDB or representative subsets of the PDB. This search system is most interest to structural biologists who wish to study the similarity within protein structure and families of proteins by structure.

PDBechem : The small molecule database. The small molecule database consists of those "residues" that have been submitted to the PDB. For example it contains all the amino acids, nucleic acids, metals, substrates and any other distinct group of atoms part of or associated with a PDB submission. This database is also used as the fundamental dictionary for all our other database and we provide a direct search interface to chemists to study this data. This interface will be most use to chemists.

PQS : The protein quartenary structure search. Submissions to the PDB are solutions to experimental data, they are not necessarily biologically active molecules. This search is based on the creation of likely biological active molecules created from the submitted experimental data and will be most use to structural biologists.

NMR representative database : This provides an interface to return a representative coordinate structure from the multiple structures that constitutes the experimental solution to an NMR experiment.

Non-expert scientific searches

The novice scientist or general search systems provide access to our multiple databases that can be used by scientists from a number of branches of biological sciences. They also combine the many types of information (structure, sequence, publication abstract, active site, ligands) into a single search system. This provides an extremely powerful resource that is more than the sum of its parts.

PDBelite : The PDBelite search system provides a simple text based search facility that provides searches based on structure, sequence, published abstracts and textual information from PDB submissions. This search system has no browser specific limitations as it is entirely HTML based.

PDBepro : PDBepro was written due to the inherent limitations of a text based search system and the complexity of designing a text based interface that provides for all possible logical combinations between multiple search targets. The interface consists of a Java (1.41 or higher version) graphical user interface where queries are constructed using logical combinations of query blocks. This represents our most advanced search system, and with a little practice, provides enormous potential to return complex queries.

Visualisation

Pictures paint a thousand words. This is certainly the case when one is trying to study macromolecular structure and sequence to understand their function. An interactive graphic system is therefore critical. Visualisation is a large and established field of science that represents the process of presenting information to a user so that they can understand what they see. This is different from "rendering" data that means the presentation of the data graphically. Not only is it important to view the raw data (such as coordinates and sequence entries by rendering it) it is important to add and correlate as much information as possible without overloading the user with too much detail. This can be achieved using visualisation techniques. Not only are we attempting to provide systems to view the results of queries, but also graphical means to create queries. Our aim is to provide facilities to allow browsing of the database information so that there is no difference between the query and results.

Query system visualisation

Our first attempt is PDBepro query generator that provides the user with a graphical drawing package to construct logical queries. This system provides a natural drawing package to allow users to construct logical hierarchy to a query, although we aim to extend these ideas. PDBepro and PDBelite are both dictionary based dynamic SQL engines. Since the query fragments are based on external dictionaries the system can be readily extended and can even provide a front end to applications such structure alignment or sequence alignment. Other query makers in the pipeline consist of 2D sketchers to generate active site queries.

Result visualisation.

There are many graphical programs available that can render coordinate and sequence information. So why create a new one ?

Our aims with the viewer .

  1. Provide a system that does not require installation. Many users do not have the access privileges to install software, or have no wish to install new software.
  2. Provide a lightweight system that will work on as many different computers and browsers as possible (given that our search system is web based).
  3. Present sequence, structure, ligand, active site and derived data in a single connected view.
  4. Highlight similarity and differences between multiple structures.

We have used the AstexViewer(TM) as the basis of the viewer and added facilities to show sequence, graphs and a number of different annotations. This viewer is written Java 1.1 and therefore will run under most legacy and current browsers without any effort from the user and will run satisfactorily even on the very oldest Pentium machines. Some of the facilities that are available now consist of :

  1. Combination and linking of different views and data. For example using the AstexViewer@PDBe-EBI we present structure, sequence, active site and graph views in a single, interactive system. Picking any view, or "brushing" a view will highlight and interact with any other view of the data. Eg, picking the sequence will highlight on the structure and any graph the picked data point.
  2. Reduced representations. We are working on methods of highlighting and representing the structure and sequence data in a simplified and stylised way.
  3. Maintenance of view context. Structure and sequence views of macromolecular data are extensive and complex. When user moves about views of this information it can be easy to get lost within the large amount of information. We have therefore generated flying views of structure and sequence so that the movement around the views maintains a progression of where they were and where they are going.
  4. Specific view context. Structure and sequence data has different levels of organization; atoms, residues, secondary structure, active sites, ligand, chains, molecules, and groups of molecules. We are trying to present views of these levels of context within this information. Thus we can define a level of context as "residue" based when zoomed in to show a small number of residues, or Assembly based when zoomed out to show a complete molecule.
  5. Similarity and difference. An important aspect of understanding macromolecular information is to view similarity and different between data. For example, it is possible to align structure, sequence, ligand and active sites to allow interpretation multiple molecules and their relationships. We have tried to provide the ability to align using the different view contexts and then highlight the similarity and different by colour, shape etc.

Application development

We have developed a number of different algorithms to solve specific problems required for our services.

  1. Secondary structure matching (SSM) was developed for the structure alignment of proteins by graph matching of secondary structure elements. This alignment algorithm can be used as a direct resource (link) or can be used as part of our search system (link).

Glossary

Yet another PDB?

The PDB exists, so why create a database? This is an obvious question but its root is based on what we are trying to do with the data. First, it is difficult to search a set of files stored on a disk, each file needs to be opened, read and closed to extract a piece of information. Much of the information within the PDB is inconsistent; ligands have a multitude of names, and their atoms are inconsistent, there are spelling errors in some of the textual information resulting in missing hits. The data content of the entries is divided into 3 different categories dependent on the experiment used to solve them. X-ray structures are asymmetric units (the repeating unit that can tessellate to generate the crystal, NMR structure consist of multiple overlapping structures, and EM tend to be of lower quality at the moment.

Our aim is to present the structures in a biological context where the molecules returned as hits are those that are believed to exist within biological systems. Also we want a self consistent view of the data, so we have standardised all spellings, atom/residue naming and correctly mapped the structure data to the sequenced data. We also want a system is searchable and can return information, not just data. The PDBe is therefore based on an Oracle system that recognizes the internal hierarchy and relationships of the data; is fast, is clean, and can return correlated information. In fact we need only to return to the OED definition of a database and databank to understand the difference between the PDBe database and the PDB :

Database : a (large) collection of data organised especially for rapid search and retrieval of information

Databank : a (large) collection of data

Our aim is therefore not to replace the PDB, this will still remain as the central and up to date archive of the experimental data of macromolecular structure. We have produced a usable system for the rapid extraction of information. A detailed discussion of the internal working of our database can be found here.

Groups

A Group of Assemblies is defined as a family of structures/sequence that can be aligned together. Since alignment is based on the chain structure/sequence of the molecules it is possible to have multiple groups with the same membership by combination of the different chains within a set of molecules.

Assembly

The Assembly is our name for the biological reconstruction of a molecule and is often different from the flat PDB file. Experimental procedures that generate atomic coordinates produce correct experimental representations of that molecule. For X-ray this is the asymmetric unit, and for NMR this means multiple solutions. A program PQS is run on these PDB files that identifies a single or a small number of ambiguous compact molecules that are sensible guestimates of biological molecules. These can be matched to the physio-chemical properties of these molecules such as the sedimentation coefficient. Although we provide a link to download the original PDB flat file from our Atlas pages, all of our search, sites and visualization is based on the assembly construction and therefore the biological context.

Chain

We define a chain in a macromolecule as a polymer of residues that does not have connectivity breaks, and contains at least 3 residues.

Residue

A residue is a monomer unit that can form a polymer chain, or a discrete group of atoms that forms a chemical entity. For example, alanine, cytosine, water and adenosine triphosphate.

Atom

The basic "indivisible" unit which has a single coordinate in 3D.

spacer