spacer

Cheminformatics and Metabolism Team - Research

Introduction

Initially performing compound-by-compound isolation and manual structure elucidation, my group at the Max-Planck-Institute of Chemical Ecology in Jena, developed both high-throughput methods for the automated structure elucidation of unknown metabolites, as well as approaches for data warehouses to store and manage dereplication data for known compounds. At Cologne University Bioinformatics Center (CUBIC), we became interested in the simulation of metabolic networks arising from our increasing knowledge on the structural identity and function of metabolites in organisms. All of our software developments have been published as open-source software. At least two of these project have attracted significant international attention, leading to interesting collaborations world-wide. The following projects are those of interest and relevance for my application at the EBI. For a complete itemized overview of our work, please refer to my publication list.

Automated structure elucidation of metabolites to alleviate the lack of data in Systems Biology.

The understanding and simulation of metabolic networks is currently hindered by a significant lack of information on the structural identity and physical properties of biochemical metabolites in organisms under investigation. Methods developed by our group provide means to quickly determine the structure of metabolites by stochastic screening of large candidate spaces based on spectroscopic methods [12,16,7,4]1. Our so-called SENECA system is based on a stochastic structure generator which is guided by a spectroscopy-based scoring function.

In order to perform this scoring we need precise and fast methods for the prediction of mass and NMR spectra. Here, we employ machine learning methods such as support vector machines to correlate graph-based molecular descriptors with database knowledge [26,25]. The resulting prediction engines are then used as judges in our SENECA scoring function or elsewhere.

In this context, a concerted effort to create a database of biological metabolites and their spectral and physicochemical properties for System Biology data is required. This will, for example, serve as sources for dereplication 2 data as well as for the training of our spectrum prediction engines. In this database warehouses, data from diverse sources (various analytical and spectroscopic instruments, NMR, MS, LC, GC) will need to be integrated and combined with already existing knowledge from systems biology databases. Two past and current projects will aid us in instantiating or contributing to the development of such a repository: Our open access, open submission, open source database NMRShiftDB for organic structures and their NMR data [17,15], as well as our current efforts to create a standard markup language CMLSpect for the representation of spectroscopic data in the framework of the Chemical Markup Language CML, developed together with Cambridge University, UK [22].

The Chemistry Development Kit (CDK)

In the past six years my group founded and developed the Chemistry Development Kit (CDK), the now leading open source library for structural chemo- and bioinformatics [20,14]. Today, the toolkit is developed by us and more than 20 contributors in academia and industry world wide. The CDK covers a wide range of functionality needed for performing virtual compound screening, property prediction and many other tasks of molecular informatics. In addition to its virtues for developing open systems in structural bioinformatics, its value for teaching cannot be overestimated. With 90.000 non commenting code statements (NCSS) in over 9000 methods in 900 classes, the CDK provides a basis for studying hands-on examples for the standard algorithms used in handling and modifying molecular structures as well as for calculating their properties, written in a modern object oriented language, using commonly accepted design pattern.

The amount of novel basic research to be performed while developing the CDK goes significantly beyond what is to be expected from what looks like a pure infrastructure project. Questions of how to perceive aromaticity, perform fingerprinting of structures or define pharmacophore queries are often researched and published in CDK context for the first time.

The Bioclipse Workbench for Molecular Informatics.

Figure1: Version 1.0rc1 of Bioclipse (http://www.bioclipse.net) showing a protein structure as well as some CDK-calculated descriptors.

In collaboration with Jarl Wikbergs group at the University of Upsala, Sweden, we have founded the Bioclipse project (http://www.bioclipse.net) to build a plug-in based, rich client desktop workbench for Molecular Informatics [21]. Bioclipse has won the JAX conference audience award for important European contribution to the development of Eclipse in the year 2006. On November 2007, we will receive the special price of the jury in the 4th edition of the Trophées du Libre.

While Upsala currently expands Bioclipse's functionality towards Proteochemometrics capabilities, my group is working on plugins for spectrum handling, database editing and to extend Bioclipse's Systems Biology capabilities. The integration of an Systems Biology Markup Language (SBML) editor and the integration of metabolomics simulations will be the next step. Together with our workflow effort sketched in the next chapter, Bioclipse will be the first state-of-the-art, user-friendly open desktop application for performing System Biology Simulations.

Automated Workflows

Figure 2: An example of a simple CDK/Taverna workflow for scanning large compound libraries for structures containing certain desired structural scaffolds

Many problems in molecular informatics call for flexible means to wire together existing technologies for providing new functionality. In a current PhD project, we therefore investigate using the open source workflow engine Taverna (http://taverna.sf.net) to integrate CDK functionality with other bioinformatics tools and Bioclipse.

CDK/Taverna workflows will allow a user to define and run a complex series of operations on large data sets, such as the growing volume of data being generated in modern automated laboratories. In addition to the flexibilty in their setup, these workflows can be saved for future re-use and shared with a broad community of users across the web. Through the existing CDK interface with open statistical packages such as R, analyses can then be performed which consider numeric, text, categorical, binary and fingerprint data simultaneously.

In addition to our CDK/Taverna integration, we are involved in a project with Michael Bertholds group the University of Konstanz, which created the Konstanz Information Miner (KNIME), where the CDK is only freely available set of chemistry-enabled nodes.

Future Directions at EBI


Research Group

All of the research projects described above will be continued at EBI and constitute a solid base for the rapid formation of a strong research group.

In addition, there is a plethora of fascinating data-analysis projects to be envisioned to increase our knowledge about biological systems. We are interested in questions such as: Given the genetic and biochemical equipment of an organism, what kind of metabolism is capable of? We are further interested in research on the semantic representation of molecular data and how machine reasoning based on molecular ontologies might be used to discover new knowledge.

To alleviate the abysmal lack of chemical data in some crucial areas of Systems Biology, we are interested in text, or better, publication mining techniques. In an ongoing collaboration with the Center for Molecular Informatics at Cambridge University, we will aim at creating an automated workflow for the extraction of molecular structures and data from the printed literature – past and present.

1 Numbers in square brackets refer to the publication list as part of my publication list.

2 The fast identification of known metabolites based on their spectroscopic fingerprint
spacer
spacer