![]() |
Protein DataBank in Europe Overview
What is the PDBe, and what do we do?PDBe is short for "Macromolecular Structure Database" IntroductionThe aim of the PDBe group is to receive experimental data from scientists (via the deposition pages), curate and add these to our internal Oracle Database, and then return information to the community. The PDBe group is principally involved with the data associated with macromolecular structure associated with the metabolism of living organisms, that is, the atomic coordinates of proteins, nucleic acids and molecules that bind with these. We also maintain links to protein sequence information, textual information from scientific publications and a number of derived properties that augment the macromolecular structure information. The macromolecular coordinate data is collectively known as the "protein structure databank" (PDB) although we provide far more information by providing a searchable database of this and links to other information. The PDBe group works in a number of areas, these include Deposition, Curation, Database generation and maintenance, Search Systems and other Projects. What the user sees is a series of services that allows them to use this work. DepositionDeposition is the name give to the process of the submission of experimental data (currently from crystallography (X-ray), Nuclear magnetic Resonance (NMR) or Electron Microscopy (EM)) by scientists to a central site that stores this data. We are one of 3 deposition sites internationally for macromolecular data: the others are the RCSB and the PDBJ. Data deposited with any of the 3 sites is exchanged weekly so that all sites hold all the public data within a week of deposition. The three different experimental techniques have their own deposition services, two of these (for X-ray and EM) are held at the PDBe along with similar services at the RCSB and PDBj. Experimental data generated by the technique of Nuclear Magnetic Resonance (NMR) can be deposited at the BioMed res bank (BMRB) :
CurationPart of the process of deposition of structure data using the deposition service is curation. Curation is the process whereby the experimental details are checked and studied by one of our staff and they confirm that the deposited details are suitable for entry into the PDB. The curator works with the depositor of the data for the smooth and coherent submission of the experiment. It should be remembered that our aim to return information to the community, so it is important the data we hold is useful the community
Database generation and maintenanceA significant part of a work carried out at the PDBe is associated with the design and maintenance of databases (DB). Macromolecular structure, our principle concern, is highly complex and requires a highly complex and large database. The Oxford English Dictionary (OED) definition of a database is a large collection of data organised especially for rapid search and retrieval of information and we like to think that our DB's reflect this definition very closely. The critical words in this definition of a DB are :
We have created a number of databases that have specific roles to manage and hold data associated with the PDB. The work carried out at the PDBe group therefore involves DB design, implementation, addition of links to other DB and data addition of these databases. It is also necessary not only to add the new depositions to this database, but also include the legacy data (those coordinates generated many years ago) to these databases - every last 10% of the data taking 90% of the time. We have based our databases on Oracle since this is a highly regarded and ubiquitous software platform suitable for holding the very large amounts of data. The search systemsThe search systems have been designed to allow return of information from the search database. Notice the distinction that we collect data from scientists, but we return information to scientists; this is one of the critical aspects of a service we have generated and strive to work on. There are six different search engines :
The first two search systems have been designed to be general and aimed at the novice user. Because of this they were initially designed to be completely dictionary based, and so the inclusion of new search keys represents a trivial process of new dictionary terms. Our aim is to be able search all the database using these generic search systems allowing the link between structure, sequence, text, function data to produce extremely complex queries. The search systems 3-6 are specific to the databases they act on and allow detailed analysis of the data contained within the respective database. These search systems are designed to be expert systems. ProjectsData model designOne issue that is important to consider is that we are neither experts in, or can hold data on, all the different types of biological data. This means that we must be able to "talk" to other databases to provide a truly integrated system. This therefore requires that we can talk in a language that the other databases can understand, and so we need to design a data exchange format. We are collaborating with a number of groups regarding the design of ways of transporting information between databases. We are looking into how we define protein structure using XML, and working with UniProtKB/Swiss-Prot on how we understand sequence details. Application Programming InterfaceA project is being undertaken to develop an Application Programming Interface (API) to the EBI-PDBe database. This consists of a series of functions that will allow external 3rd party software to access the EBI-PDBe database independently. This is based around a SOAP-XML based messaging system where SOAP's latest definition is as follows: SOAP is a lightweight protocol intended for exchanging structured information in a decentralised, distributed environment. SOAP uses XML technologies to define an extensible messaging framework, which provides a message construct that can be exchanged over a variety of underlying protocols. The framework has been designed to be independent of any particular programming model and other implementation specific semantics. The SOAP system we have generated allows a user to write their own applications in a number different computer languages (C, C++, Java, Fortran, Perl) and use calls to our data within the database. This allows the remote and direct use of the data for users and their own specific applications. Data harvestingData harvesting is the process of recording all the data necessary to reproduce an experiment, and in the case of biological data this includes the source of the material, all the processing steps, and up to the validation of the final experiment that generates the structure coordinates. This is known as "cradle to grave" data recording. The idea of data harvesting is to remove the recording of basic details of the process from the scientist so that we gain a systematic and complete set of information. ValidationAll experiments require some sort of validation and the generation of coordinate data is no exception. We are coordinating a number of procedures that provide data on the quality of the information that we store within the PDBe database. Application developmentThere are a number of on-going projects to develop various algorithms and services. This includes the deposition systems (in particular EMdep), and methods such as structure alignment for proteins, ligands and active sites. Our web ServicesOur home page is http://www.ebi.ac.uk/pdbe/, but you are probably aware of this as you have already reached this page. We have 7 key links amongst those on our home page.
We provide a number of services, these are arranged around our central database resources. This starts with the deposition of data generated by experiment and ends with the return of the information using our search systems. Our search systems have been designed for the user catagories expert user and novice scientist". Databases - a users viewA database is defined as "a large collection of data organised especially for rapid search and retrieval" whereas a databank is "a collection of data"(OED). At the PDBe we have created databases that have been organised so as to recognize the intrinsic relationships and hierarchy of macromolecular coordinate data. It is fundamental to the concept of the database that the data held are self-consistent, readily searchable and clean. There are a number of different databases held at the EBI and a detailed service view can be found here (link to Dimitris pages); this desciption is an overview of what the user can "see". The deposition databaseThe deposition database is organised to maximize the integrity of the data. The deposition database's primary aim is to define a clean and highly organised data collection. It is necessary that we confirm that all the details of the PDB are correct and sensible and can be searched within the search database. For instance :
The deposition database is highly normalised, such that it contains many relationships and dictionary definitions and much of the data is only stored once in the dictionary. For example, consider the residue alanine; we store a single dictionary definition for this alanaine, and only store the separate coordinates, temperature factors and occupancy values for each different alanine. Thus, we define the deposition database as highly normalised. This database has 143 "foreign keys" - Oracle speak for these relationships. The primary aim of this database is therefore to enforce an internal consistency on the data and make the data "clean". Getting data into this deposition database is therefore a difficult step, particularly with "legacy" or older structures where we find some interesting protein properties. Chemistry dictionary (PDBechem)The chemistry dictionary database holds the single residue definitions for the PDB. Each residue has a single entry within this database, although separate and different entries exist for different stereo isomers of a residue. This database contains many properties for each residue such as smiles strings, mathematical graphs, bit strings and such like, and can be search and analysed directly via the PDBechem search interface. The database represents the residue definition for all other databases at the EBI. Search databaseThe aim of this database is to provide a searchable database for the community, and use of PDBelite and PDBepro interfaces interact with this database directly. The search database is derived by the process of database "transformation" from the deposition database, and is a "de-nomalised" representation of the same data. This is Oracle-speak for the process of rebuilding each alanine residue so that there are multiple complete copies for each occurrence of this residue. This makes the database larger, less complex and much for efficient for its fundamental role of returning information as fast as possible. Active site databaseThe active site database is our first derived information database. The active site database is generated by analyzing the environment around all the ligands found within the deposition database. This allows the ability find all occurrences of Zinc finger active sites using this information, or the statistical distribution of each residue type of interaction with ATP. The search is available using the PDBesite interface from the search page. This is an extremely powerful tool since it provides a window to one of the main roles of macromolecular structure - the function and in particular, binding to substrates. It should be noted that this database does not contain entries for Apo-proteins, those active sites that are similar to one containing a ligand, but does not itself contain a ligand. This database represents our first of many pieces of derived information that will be extracted from the PDB data. Target databaseThe Target Information database stores Structural Genomics targets, and there is a dedicated interface to allow a user to search this data.. These include public domain SG and PDB including pre-release sequences. The targets are mainly protein sequences so that a given search query contains related links to structural data (PDB). Tracking targets with this server is done by either sequence similarity between user's sequence(s) and SG targets or by direct detailed search (i.e. by searching for targets by their status, protein name, or organism source .. etc.). Search systemsA database is of no use without a means to search and retrieve information from it. We have a number of web search interfaces at the current time, although our aim is to integrate these as much as possible to provide the ability to search on a number of different pieces on information in a connected way. A search system can be as simple as say the PDBefold interface, where a Perl script is used to submit calculations to our computer farm to return structure alignment information. It can also be as complex as the PDBepro search system that provides the user with a graphical view of the database content and allows the combination of search terms (including structure and sequence alignment) using logical statements such as AND, OR and NOT. The multiple interfaces have been generated to try and provide access for the expert user and the novice scientist. By definition, a search system that provides access to all aspects of biological data (structure, sequence, active site, published abstracts) must be accessible and understood by the novice scientist since the biological science is highly diverse and a huge field. The design of these search systems has concentrated on some primary aims; we have not yet succeeded in all of these - but we are working on it.
Different users.We can define different classes of users of our site:
We are therefore concentrating on the expert users and the novice scientists at the moment with different search systems aimed at these two groups of users. Expert systemsWe define an expert system as a resource that is available to scientists that allows them to study data and return information within the field of their expertise. These systems are unlikely to be informative to scientists within other fields of science as they require detailed knowledge to both define a query and also understand the answer. The expert systems available at the PDBe consist of PDBesite : the active site database. A comprehensive interface that provides possibilities to research many aspects of active sites in proteins/nucleic acids, both in terms of the ligands and also the environment about these ligands. This will be of most interest to structural biologists as the emphasis the local 3D environment about ligands. PDBetarget : the target database. PDBefold : the fold database. This service provides the ability to identify similar folds within the current published protein structures. The interface allows submission of a set of coordinates which is then screened using the program SSM (ref) against the PDB or representative subsets of the PDB. This search system is most interest to structural biologists who wish to study the similarity within protein structure and families of proteins by structure. PDBechem : The small molecule database. The small molecule database consists of those "residues" that have been submitted to the PDB. For example it contains all the amino acids, nucleic acids, metals, substrates and any other distinct group of atoms part of or associated with a PDB submission. This database is also used as the fundamental dictionary for all our other database and we provide a direct search interface to chemists to study this data. This interface will be most use to chemists. PQS : The protein quartenary structure search. Submissions to the PDB are solutions to experimental data, they are not necessarily biologically active molecules. This search is based on the creation of likely biological active molecules created from the submitted experimental data and will be most use to structural biologists. NMR representative database : This provides an interface to return a representative coordinate structure from the multiple structures that constitutes the experimental solution to an NMR experiment. Non-expert scientific searchesThe novice scientist or general search systems provide access to our multiple databases that can be used by scientists from a number of branches of biological sciences. They also combine the many types of information (structure, sequence, publication abstract, active site, ligands) into a single search system. This provides an extremely powerful resource that is more than the sum of its parts. PDBelite : The PDBelite search system provides a simple text based search facility that provides searches based on structure, sequence, published abstracts and textual information from PDB submissions. This search system has no browser specific limitations as it is entirely HTML based. PDBepro : PDBepro was written due to the inherent limitations of a text based search system and the complexity of designing a text based interface that provides for all possible logical combinations between multiple search targets. The interface consists of a Java (1.41 or higher version) graphical user interface where queries are constructed using logical combinations of query blocks. This represents our most advanced search system, and with a little practice, provides enormous potential to return complex queries. VisualisationPictures paint a thousand words. This is certainly the case when one is trying to study macromolecular structure and sequence to understand their function. An interactive graphic system is therefore critical. Visualisation is a large and established field of science that represents the process of presenting information to a user so that they can understand what they see. This is different from "rendering" data that means the presentation of the data graphically. Not only is it important to view the raw data (such as coordinates and sequence entries by rendering it) it is important to add and correlate as much information as possible without overloading the user with too much detail. This can be achieved using visualisation techniques. Not only are we attempting to provide systems to view the results of queries, but also graphical means to create queries. Our aim is to provide facilities to allow browsing of the database information so that there is no difference between the query and results. Query system visualisationOur first attempt is PDBepro query generator that provides the user with a graphical drawing package to construct logical queries. This system provides a natural drawing package to allow users to construct logical hierarchy to a query, although we aim to extend these ideas. PDBepro and PDBelite are both dictionary based dynamic SQL engines. Since the query fragments are based on external dictionaries the system can be readily extended and can even provide a front end to applications such structure alignment or sequence alignment. Other query makers in the pipeline consist of 2D sketchers to generate active site queries. Result visualisation.There are many graphical programs available that can render coordinate and sequence information. So why create a new one ? Our aims with the viewer .
We have used the AstexViewer(TM) as the basis of the viewer and added facilities to show sequence, graphs and a number of different annotations. This viewer is written Java 1.1 and therefore will run under most legacy and current browsers without any effort from the user and will run satisfactorily even on the very oldest Pentium machines. Some of the facilities that are available now consist of :
Application developmentWe have developed a number of different algorithms to solve specific problems required for our services.
GlossaryYet another PDB? The PDB exists, so why create a database? This is an obvious question but its root is based on what we are trying to do with the data. First, it is difficult to search a set of files stored on a disk, each file needs to be opened, read and closed to extract a piece of information. Much of the information within the PDB is inconsistent; ligands have a multitude of names, and their atoms are inconsistent, there are spelling errors in some of the textual information resulting in missing hits. The data content of the entries is divided into 3 different categories dependent on the experiment used to solve them. X-ray structures are asymmetric units (the repeating unit that can tessellate to generate the crystal, NMR structure consist of multiple overlapping structures, and EM tend to be of lower quality at the moment. Our aim is to present the structures in a biological context where the molecules returned as hits are those that are believed to exist within biological systems. Also we want a self consistent view of the data, so we have standardised all spellings, atom/residue naming and correctly mapped the structure data to the sequenced data. We also want a system is searchable and can return information, not just data. The PDBe is therefore based on an Oracle system that recognizes the internal hierarchy and relationships of the data; is fast, is clean, and can return correlated information. In fact we need only to return to the OED definition of a database and databank to understand the difference between the PDBe database and the PDB : Database : a (large) collection of data organised especially for rapid search and retrieval of information Databank : a (large) collection of data Our aim is therefore not to replace the PDB, this will still remain as the central and up to date archive of the experimental data of macromolecular structure. We have produced a usable system for the rapid extraction of information. A detailed discussion of the internal working of our database can be found here. GroupsA Group of Assemblies is defined as a family of structures/sequence that can be aligned together. Since alignment is based on the chain structure/sequence of the molecules it is possible to have multiple groups with the same membership by combination of the different chains within a set of molecules. AssemblyThe Assembly is our name for the biological reconstruction of a molecule and is often different from the flat PDB file. Experimental procedures that generate atomic coordinates produce correct experimental representations of that molecule. For X-ray this is the asymmetric unit, and for NMR this means multiple solutions. A program PQS is run on these PDB files that identifies a single or a small number of ambiguous compact molecules that are sensible guestimates of biological molecules. These can be matched to the physio-chemical properties of these molecules such as the sedimentation coefficient. Although we provide a link to download the original PDB flat file from our Atlas pages, all of our search, sites and visualization is based on the assembly construction and therefore the biological context. ChainWe define a chain in a macromolecule as a polymer of residues that does not have connectivity breaks, and contains at least 3 residues. ResidueA residue is a monomer unit that can form a polymer chain, or a discrete group of atoms that forms a chemical entity. For example, alanine, cytosine, water and adenosine triphosphate. AtomThe basic "indivisible" unit which has a single coordinate in 3D. |