![]() |
Cheminformatics and Metabolism Team - ResearchIntroductionInitially performing compound-by-compound isolation and manual structure elucidation, my group at the Max-Planck-Institute of Chemical Ecology in Jena, developed both high-throughput methods for the automated structure elucidation of unknown metabolites, as well as approaches for data warehouses to store and manage dereplication data for known compounds. At Cologne University Bioinformatics Center (CUBIC), we became interested in the simulation of metabolic networks arising from our increasing knowledge on the structural identity and function of metabolites in organisms. All of our software developments have been published as open-source software. At least two of these project have attracted significant international attention, leading to interesting collaborations world-wide. The following projects are those of interest and relevance for my application at the EBI. For a complete itemized overview of our work, please refer to my publication list. Automated structure elucidation of metabolites to alleviate the lack of data in Systems Biology.The understanding and simulation of metabolic networks is currently hindered by a significant lack of information on the structural identity and physical properties of biochemical metabolites in organisms under investigation. Methods developed by our group provide means to quickly determine the structure of metabolites by stochastic screening of large candidate spaces based on spectroscopic methods [12,16,7,4]1. Our so-called SENECA system is based on a stochastic structure generator which is guided by a spectroscopy-based scoring function. In order to perform this scoring we need precise and fast methods for the prediction of mass and NMR spectra. Here, we employ machine learning methods such as support vector machines to correlate graph-based molecular descriptors with database knowledge [26,25]. The resulting prediction engines are then used as judges in our SENECA scoring function or elsewhere. In this context, a concerted effort to create a database of biological metabolites and their spectral and physicochemical properties for System Biology data is required. This will, for example, serve as sources for dereplication 2 data as well as for the training of our spectrum prediction engines. In this database warehouses, data from diverse sources (various analytical and spectroscopic instruments, NMR, MS, LC, GC) will need to be integrated and combined with already existing knowledge from systems biology databases. Two past and current projects will aid us in instantiating or contributing to the development of such a repository: Our open access, open submission, open source database NMRShiftDB for organic structures and their NMR data [17,15], as well as our current efforts to create a standard markup language CMLSpect for the representation of spectroscopic data in the framework of the Chemical Markup Language CML, developed together with Cambridge University, UK [22]. The Chemistry Development Kit (CDK)In the past six years my group founded and developed the Chemistry Development Kit (CDK), the now leading open source library for structural chemo- and bioinformatics [20,14]. Today, the toolkit is developed by us and more than 20 contributors in academia and industry world wide. The CDK covers a wide range of functionality needed for performing virtual compound screening, property prediction and many other tasks of molecular informatics. In addition to its virtues for developing open systems in structural bioinformatics, its value for teaching cannot be overestimated. With 90.000 non commenting code statements (NCSS) in over 9000 methods in 900 classes, the CDK provides a basis for studying hands-on examples for the standard algorithms used in handling and modifying molecular structures as well as for calculating their properties, written in a modern object oriented language, using commonly accepted design pattern. The amount of novel basic research to be performed while developing the CDK goes significantly beyond what is to be expected from what looks like a pure infrastructure project. Questions of how to perceive aromaticity, perform fingerprinting of structures or define pharmacophore queries are often researched and published in CDK context for the first time. The Bioclipse Workbench for Molecular Informatics.
Automated Workflows
CDK/Taverna workflows will allow a user to define and run a complex series of operations on large data sets, such as the growing volume of data being generated in modern automated laboratories. In addition to the flexibilty in their setup, these workflows can be saved for future re-use and shared with a broad community of users across the web. Through the existing CDK interface with open statistical packages such as R, analyses can then be performed which consider numeric, text, categorical, binary and fingerprint data simultaneously. In addition to our CDK/Taverna integration, we are involved in a project with Michael Bertholds group the University of Konstanz, which created the Konstanz Information Miner (KNIME), where the CDK is only freely available set of chemistry-enabled nodes.Future Directions at EBI
|