spacer

Thornton Group Research: Computational Structural Biology

Image
Research: computational structural biology

The goal of our research is to understand more about how biology works at the molecular level, how enzymes perform catalysis, how these molecules recognise one another and their cognate ligands, and how proteins and organisms have evolved to create life. We develop and use novel computational methods to analyse the available data, gathering data, either from the literature or by mining the data resources, to answer specific questions. Much of our research is collaborative, involving either experimentalists or other computational biologists.


Enzymes structure and function

Enzyme activity is essential for almost all aspects of life. With completely sequenced genomes, the full complement of enzymes in an organism can be defined, and 3D structures have been determined for many enzyme families. Traditionally each enzyme has been studied individually, but as more enzymes are characterised it is now timely to revisit the molecular basis of catalysis, by comparing different enzymes and their mechanisms, and to consider how complex pathways and networks may have evolved. Therefore in order to understand more about how enzymes work, how they evolve and how to predict their function from structure we have performed a series of analyses (see figures 1-7).

The first step was to create the Catalytic Site Atlas (Porter et al 2004), which describes the residues involved in catalysis, as identified by structural and biochemical experiments. These data have been painstakingly extracted from the literature and the current set includes almost 500 enzymes, which have been used to create a database of protein catalytic domains (George et al 2004). A similar resource based on metal binding sites is under construction. From these data, an automated pipeline to generate 3D templates for catalytic residues from the CSA has been constructed and the specificity and sensitivity of over 100 templates has been assessed. The templates provide a powerful approach to function recognition from structure (see below). However it is also important to consider the flexibility of enzymes and their active sites. Therefore a study of conformational change in enzymes during the catalytic cycle was performed and revealed surprisingly small movements for most enzyme active sites on binding the substrate or product, although of course there are exceptions (Gutteridge & Thornton 2004, 2005). This work has implications for the "induced fit" model of enzyme mechanisms.

The ligand diversity of protein families in E. coli and all organisms shows that some families have few members and tend to conserve their substrate "type", whilst others are very diverse and bind molecules from across the metabolite spectrum. An in silico cross-docking approach was used to explore enzyme specificity and promiscuity. Since many metabolites are very similar, this study investigated the hypothesis that enzymes only recognise and bind their cognate ligand. Alternatively they may be less specific and bind many such related ligands (see figures 1-7). The results suggested that many enzymes are promiscuous in their binding and that a preference for the cognate substrate was only observed when both enzyme and substrate specificity were considered (Macchiarulo et al 2004).


Principles of protein structure

The group has studies various aspects of protein structure to understand more about the relationship between sequence, structure and function. Protein-protein interactions drive much of biology and with the larger Protein Data Bank (PDB), it has been possible to extend our analyses of dimeric proteins, to consider the higher order multimers. A detailed review of protein-protein interactions, studying obligate multimeric proteins in crystals, including trimers, tetramers and hexamers, has been published. Thus study has endorsed the correlation between molecular weight of the protomer and the surface accessibility of the complex. It has also confirmed the importance of hydrophobic effects as the driving force stabilising these complexes (Ponstingl et al 2005). There remains a challenge to define the biological multimer from the crystallographic data (see figure 8), and it is clear that although 90% accuracy can be achieved, solution data are needed for some proteins.

Four other studies have focussed on different aspects of protein structure – protein surfaces and solubility, alpha-helical membrane proteins, structural repeats and intron/exon sites. We became interested in the solubility of proteins and how this relates to the surface distribution of polar groups. We found protein surfaces are liberally sprinkled with hydrogen-bonding polar groups and that there are few hydrophobic patches on the surface (except in interfaces). Furthermore during evolution, the sequence changes to conserve surface patch polarity (Shanahan & Thornton 2004). As part of a project to try to predict the structure of an alpha-helical membrane-bound protein, we made a study of all such structures in the protein databank, calculating residue propensities for the different regions of the membrane (Eyre et al 2004). Lastly we attempted to derive an algorithm to detect structural repeats automatically. These are interesting from an evolutionary perspective and the first step in such an analysis is to derive a good dataset. In practice this proved more difficult than expected, involving many of the same methods used for detecting repeats in sequences (Murray et al 2004). In dealing with large eukaryotic genomes one important aspect is correct identification and splicing of intron/exon junctions. We are currently analysing the relationship between where the exon boundaries occur in a structure to explore if we find any correlation with secondary structure, functional modules or domain structure.

We continue to be interested in the evolution of proteins, and recently have been investigating whether we can identify differences between evolutionary changes observed in orthologues and paralogues. This has developed into exploring methods to identify orthologues from sequences, which remains a difficult problem, and then using structural data to rationalise those sites in a sequence that appear to be under the strongest selective pressure in recent family expansions. This work is in progress.


Functional annotation of proteins through structural data

A major focus worldwide is to provide new methods to improve the annotation of protein sequences, especially those for which little functional information is available (Jones & Thornton 2004). In our group we are developing novel tools that use 3D structural data to help in functional annotation. During the last year, much attention has focussed on template approaches, which search a new structure for function-associated templates. These templates are either derived from the Catalytic Site Atlas for enzyme active sites, or extracted automatically from the PDB for ligand-associated sites. One goal is to recognise DNA-binding proteins, which are often observed to include generic DNA-binding motifs (e.g. HTH). By combining structural data with electrostatic potentials, we have improved the recognition rate for DNA-binding proteins (Shanahan et al 2004). In addition, we have recently developed a novel approach to identify DNA-binding proteins from structural data using a combination of Hidden Markov Models and 3D motifs. These models are useful for identifying distant relatives, and the structural motifs also transcend evolutionary families, to identify DNA-binding proteins that use the same motif but are not evolutionarily related.

However almost 10% of the current ‘unknown’ structures are recalcitrant to any of the current methods and therefore we are developing ‘ab initio’ approaches, which aim to identify binding sites, using geometrical features and conservation scores. We then compare these sites (their shape and electrostatic properties) with others in the PDB and also compare the shape of ligands against a binding site using various methods, including graph matching and spherical harmonics. These methods will use a list of all possible ligands, derived from the metabolome. These approaches have the potential to predict the cognate ligands or "ligand cluster" for a structure from first principles.

All these tools are being utilised to help predict function from three-dimensional structure, as part of our involvement in European and U.S. structural genomics projects. The methods seek to improve both the annotation of structures with functional information and the annotation of sequences with information derived from structures. Over 200 protein structures, determined by the Mid-West Centre for Structural Genomics, have been annotated, with varying degrees of reliability (see for example Savchenko et al 2004; Sanishvili et al 2004).


Genome annotation

Continuing my long-standing collaboration with Professor Orengo at UCL, we have used the CATH database to provide structure-based domain annotations of many genomes. Using the Gene3D data resource (Buchan et al 2002), we have analysed protein family expansion patterns. As bacterial species increase their gene complement, individual protein families reveal different patterns of expansion (Ranea et al 2004). Protein families involved in replication (e.g. ribosomal proteins) maintain a constant size in all species; metabolic enzymes grow linearly as the proteome grows and proteins involved in regulation and signal transduction expand quadratically. These expansion patterns can be used to estimate the optimal (~4500) and maximal (~10,000) size for single cell bacteria, in agreement with experimental data (Ranea et al 2005). In parallel we have been exploring how the enzyme complement of a species changes as the proteome expands. This reveals differences in the different types of enzymes (as defined by their E.C. reaction numbers). We are also involved with a project to develop GRID technologies to improve "distributed functional annotation" of genomes, using structural data as an exemplar. This is in collaboration with Ewan Birney at EBI, and bioinformaticians and computer scientists at UCL (Jones, Orengo & Sorenson) and Imperial College (Sternberg & Darlington).


Functional genomics analysis of ageing

As part of a large consortium based at UCL, we have developed a robust approach for processing expression data for flies, worms and mice and for analysing the results. The protocols and data are stored in AgeBase, which is used by the experimental partners in the consortium. These methods have already been used to compare expression patterns in wild-type and mutant worms, which suggest that the detoxification system is involved in longevity assurance (McElwee et al 2004). Currently we are developing new methods to extract biological knowledge from the expression data through analysis of new data for ageing-related mutants and calorie-restricted organisms in different species. We have also used these tools to consider the differential expression of genes in different tissues and how this relates to their evolutionary lineage.


Web tool and resource development

As part of our research we have developed various web tools and data resources, which are made available over the web. PDBsum, a data resource that provides a graphical summary of the contents of a PDB entry, has been completely revised and released at the EBI. It is heavily used worldwide. An important recent addition is that a summary file can now be generated automatically for any uploaded set of coordinates (Laskowski et al 2005). The Catalytic Site Atlas has been released and active development continues. The ProFunc (Protein Function from Structure) pipeline (Laskowski 2005) has been further developed to perform a set of computational analyses on new structures, to help to assign their function. This includes sequence comparisons, structure comparisons, template searching, domain annotations, genome context and new methods developed in the laboratory including DNA-motif templates, including electrostatic effects; catalytic templates; and "reverse" templates.

For a complete list of current Thornton Group web services, click here


Figure 1

Figure 1.The structure of phosphoglycerate kinase with a bound ATP analog. Different analyses can be performed on the structure as illustrated in figures 2-7.


Figure 2

Figure 2. Induced fit motions can be large as shown by the two loop conformations, but a survey of many structures shown in the histogram demonstrates that induced fit is often small.


Figure 3

Figure 3.Structural templates based on catalytic residues can be used to identify convergent evolution in active sites; the templates of three catalytic triads are shown.


Figure 4

Figure 4. Spherical harmonics provides a method for mathematically describing the shape of ligands and binding sites - the spherical harmonic approximation of the protein crambin is shown.


Figure 5

Figure 5. The binding sites for a particular ligand can be analysed to find points of commonality. For example, the common features of adenine binding sites, as derived from many unrelated structures, are shown (Pink, aromatic; Yellow, hydrophobic; red, hydrogen bond acceptor; cyan, hydrophilic).


Figure 6

Figure 6.Metabolomics gives us a pathway view of enzymes, the similarity scores for each reactant - product pair in the TCA pathway are shown.


Figure 7

Figure 7. The selectivity of ligand binding by enzymes is investigated using docking. The figure shows the docking score of the cognate ligand compared to other random ligands for a selection of enzymes.


Figure 8

Figure 8. Inferring protein quaternary structure from a crystal lattice. The atomic coordinates of a protein derived from a crystal structure are stored in the PDB as a non-redundant representation of the protein crystal and generally do not represent the physiologically relevant protein assembly. This figure illustrates that with no additional data it is often a complex problem to infer this assembly just from the crystal. Even if the subunit multiplicity of the protein is known from experiment, the precise subunit arrangement is often ambiguous. Automated methods to infer quaternary structure from the crystal have been developed and are successful for approximately 90% of structures.


Publications: 2004

Berezin, C., Glaser, F., Rosenberg, J., Paz, I., Pupko, T., Fariselli, P., Casadio, R., & Ben-Tal, N. (2004) ConSeq: the identification of functionally and structurally important residues in protein sequences. Bioinformatics. 22;20(8):1322-4. PMID 14871869

Bray, J.E., Marskden, R.L., Rison, S.C., Savchenko, A., Edwards, A.M., Thornton, J.M. & Orengo, C.A. (2004) A practical and robust sequence search strategy for structural genomics target selection. Bioinformatics. 20 (14), 2288-95. PMID 15201178

Eyre, T.A., Partridge, L. & Thornton, J.M. (2004) Computational analysis of α-helical membrane protein structure: implications for the prediction of 3D structural models. Protein Eng. Des. Sel. 17(8), 613-624. PMID 15388845

George, G.A., Spriggs, R.V., Thornton, J.M. Al-kazikani, B. & Swindells, M.B. (2004) SCOPEC: a database of protein catalytic domains. Bioinformatics. 20, I130-I136. PMID 15262791

Gutteridge A, Thornton J. (2004) Conformational change in substrate binding, catalysis and product release: an open and shut case? FEBS Lett. 2004 Jun 1;567(1):67-73. Review. PMID 15165895

Jones, S. & Thornton, J.M. (2004) Searching for functional sites in protein structures. Curr.Opin. Chem. Biol. 8(1), 3-8. PMID 15036149

Macchiarulo, A., Nobeli, I., & Thornton, J.M. (2004) Ligand selectivity and competition between enzymes in silico. Nat Biotechnol. 22, 1039-45. PMID 15286657

McElwee, J.J., Schuster, E., Blanc, E, Thomas, J.H., Gems, D. (2004) Shared transcriptional signature in Caenorhabditis elegans Dauer larvae and long-lived daf-2 mutants implicates detoxification system in longevity assurance. J. Biol. Chem. 279(43):44533-43 PMID 15308663

Melamed, D., Mark-Danieli, M., Kenan-Eichler, M., Kraus, O., Castiel, A., Laham, N., Pupko, T., Glaser, F., Ben-Tal, N., Bacharach, E. (2004) The conserved carboxy terminus of the capsid domain of human immunodeficiency virus type 1 gag protein is important for virion assembly and release. J Virol. 78(18):9675-88.

Murray, K.B., Taylor, W.R. & Thornton, J.M. (2004) Toward the detection and validation of repeats in protein structure. Proteins. 57, 365-380. PMID 15340924

Nobeli, I. & Thornton, J.M. (2004) The Metabolome. In ‘The Encyclopaedia of Computational Chemistry’. von Ragué Schleyer, P. (Editor-in-Chief), Jon Wiley & Sons, UK. ISBN: 0-471-96588-X.

Porter, C.T., Bartlett, G.J., Thornton, J.M. (2004) The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res. 32, D129-133. PMID 14681376

Ranea, J.A, Buchan, D.W., Thornton, J.M. & Orengo, C.A. (2004) Evolution of protein superfamilies and bacterial genome size. J. Mol. Biol. 336, 871-887. PMID 15095866

Sanishvili, .R, Wu, R., Kim, D.E., Watson, J.D., Collart, F., Joachimiak, A. (2004) Crystal structure of Bacillus subtilis YckF: structural and functional evolution. J Struct Biol. 2004 148(1):98-109. PMID 15363790

Savchenko, A., Skarina, T., Evdokimova, E., Watson,J.D., Laskowsk,i R., Arrowsmith, C.H., Edwards, A.M., Joachimiak, A., Zhang, R.G. (2004) X-ray crystal structure of CutA from Thermotoga maritima at 1.4 A resolution. Proteins. 54(1):162-5. PMID 14705033

Shanahan, H.P. & Thornton, J.M. (2004) An examination of the conservation of surface patch polarity for proteins. Bioinformatics,20(14), 2197-204. PMID 15073014

Shanahan, H.P., Garcia,M.A., Jones, S., Thornton, J.M. (2004) Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucl. Acid. Rev., 32(16):4732-41. PMID 15356290

Vogel, C., Bashton, M., Kerrison, N.D., Chothia, C., Teichmann, S.A. (2004) Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol. 14, 208-16. PMID 15093836

Vogel, C., Berzuini, C., Bashton, M., Gough, J., Teichmann, S.A. (2004) Supra-domains: evolutionary units larger than single protein domains. J Mol Biol. 336, 809-23. PMID 15095989


References: other

Buchan, DWA, Shepherd, A.J., Lee, D., Pearl, F., Rison S.C.G., Thornton, J.M. & Orengo, C.M. (2002) Gene 3D: Structural assignment for whole genes and genomes using the CATH domain structure database. Genome Res. 12, 503-514. PMID 11875040

Glaser, F., Rosenberg, Y., Kessel, A., Pupko, T. & Ben-Tal N. (2005) ConSurf-HSSP: A Database of Functional Regions in Proteins. Proteins; In press

Gutteridge A , Thornton J. (2005) Conformational Changes Observed in Enzyme Crystal Structures Upon Substrate Binding. J Molecular Biol. 2005 Feb 11;346(1):21-8. Epub 2004 Dec 23. PMID 15663924

Laskowski, R.A. (2005) Determining function from structure. In 'Structural Proteomics', Sundstrom, M., Norin, M., & Edwards, A. (eds.), CRC Press, in press

Laskowski, R.A., Chistyakov, V.V, & Thornton, J.M. (2005) PDBsum more: new summaries and analyses of the known 3D structures of proteins and nucliec acids. Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D266-8. PMID 15608193

Massingham, T & Goldman, N. (2005) Detecting amino acid sites under positive selection and purifying selection. Genetics. 2005 Jan 16 (ePub ahead of print). PMID 15654091

Ponstingl, H., Kabir, T., Gorse, D. & Thornton, J.M. (2005) Morphological aspects of oligomeric protein structures. Prog. Biophys. Mol. Bio., in press

Ranea, J.A.G., Grant, A., Thornton, J.M. & Orengo, C.A. (2005) Microeconomic principles explain an optimal genome size in bacteria. Trends Genet. 2005 Jan;21(1):21-5. PMID 15680509

Contact

We would like to encourage laboratories wishing to discuss any collaborations to contact us. For information, comments and/or suggestions please contact us.

spacer
spacer