Small molecule metabolism in biological systems
The understanding and simulation of metabolic networks is currently hindered by a significant lack of information on the structural identity and physical properties of biochemical metabolites in organisms under investigation. Developing computational methods and tools towards deciphering this missing information forms the heart of our research. The Steinbeck group's research is dedicated to the elucidation of metabolomes, Computer-Assisted Structure Elucidation (CASE), the reconstruction of metabolic networks and algorithm development in chem- and bioinformatics.
Generating and Curating High-Quality Metabolic Models using Chemical Structure
Creating a high-quality genome-scale metabolic model reconstruction requires meticulous manual curation and can take several years to finish. Consequently, many automated pipelines and curation tools have emerged to assist in the process. Despite the tools available, there remains a disconnect, with extensive curation still required on automatically produced draft-reconstructions. We have developed a flexible desktop application (Metingear) and library (MDK) that allows development of new and existing models utilising the chemical structure of metabolites. The chemical structure can be utilised for unambiguous metabolite identification, which is important when comparing and merging existing models.
Structure Identification in Mass Spectrometry-based Metabolomics
Part of our research comprises the development and implementation of methods to analyze tandem mass spectrometry (MSn) data in metabolomics. Over the last years, tandem and accurate mass MS have become the techniques of choice to study the metabolome, with various instruments and methods being available to cover the whole metabolome landscape. However, the chemical diversity of the metabolome and a lack of accepted reporting standards make the analysis inherently challenging and time-consuming. Typical mass spectrometry-based studies generate complex data where the signals of interest are obscured by systematic and random noise. Proper data preprocessing and consequent peak detection and extraction is essential for compound identification.
We work on a modular workflow-based MS data analysis system to facilitate efficient compound identification and further open standards and free data / methods exchange in the field of metabolomics. Choosing a non-commercial, open-source workflow environment guarantees limitless accessibility and gives the data analyst the advantage of having a variety of analytical pre- and post-processing methods available from the community. Ongoing efforts include the development of the MassCascade library and the KNIME plugin, adoption of open standards from the Metabolomics Standards Initiative, and implementation of robust methods for peak identification going beyond simple mass and spectra similarity queries.
Species-specific metabolome inference
Metabolomics and lipidomics are experimental areas that suffer from identifying only a minor fraction of small molecules out of the large number of small molecules detected. One of the reasons for this is the lack of adequate databases that contain extensive metabolome data (small molecules, reactions, association to enzymes, biological containers, etc.) in a species-specific way. Towards this problem, we explore different alternatives of producing such type of resources: chemical unification of existing metabolism databases; text mining of small molecules, proteins, tissues/cell types, and organisms; and chemical enumeration of generic reactions and lipids. Through these approaches we provide species-specific molecule catalogues that aim to improve the chances of researchers in metabolomics to identify detected small molecules.
Metabolism data integration – through a novel merge method – shows that merging metabolism resources significantly increases the size of the metabolite catalogue. This is complemented by a text mining pipeline, which – analyzing PubMed abstracts – produces some thousands of additional metabolites and relations between tissues and small molecules. Results retrieved only through text mining have a bias towards exogenous small molecules.
On enumerating generic reactions from the previous sets, the number of small molecules generated grows exponentially and only a few paths lead to known metabolites. To narrow down the results, we explore methods, which rely on thermodynamic feasibility, catalogue lookups, and reaction similarity.
Polyketide structure prediction
Polyketides are complex, mostly high weight small molecules, produced mainly by secondary metabolism in bacteria and fungi, and have a wide variety of applications. Huge modular enzymes, called polyketide synthases (PKS), assemble polyketides through a series of elongation steps, where malonyl-CoA derivatives are added (but only a C2-unit is incorporated due to decarboxylation), similar in a way to fatty acids synthesis. Examples of well known polyketides are erythromycin or tetracyclines. Polyketides in general have found applications as antibiotics, anti-tumoral agents, anti-fungals, insecticides and growth factors, among others.
Working on trans-AT polyketides, and in close collaboration with Prof. Piel at the University of Bonn, we have implemented an algorithm for the recognition of fine grained keto synthase domain variants, that allow to produce structural hypothesis for a polyketide starting from the sequence of the poliketyde synthethase that it is assembling it.
Computer-Assisted Structure Elucidation (CASE)
Computer-Assisted Structure Elucidation (CASE) methods developed by our group provide means to determine the structure of metabolites by stochastic screening of large candidate spaces based on spectroscopic methods. Our so-called SENECA system is based on a stochastic structure generator, which is guided by a spectroscopy-based scoring function. Simulated Annealing and Evolutionary Algorithms are at the core of the structure generation process allowing to explore the structural space of isomers.
In order to perform the scoring, we need precise and fast methods for the prediction of mass and NMR spectra. Here, we employ machine learning methods such as support vector machines to correlate graph-based molecular descriptors with database knowledge. The resulting prediction engines are then used as judges in our SENECA scoring function or elsewhere. Further, to effectively narrow down our search space during the structure determination process, we employ Natural Product (NP) likeness as a filter (for more details, see below). The stochastic CASE tool is valuable in structure elucidation of newly isolated Natural Products or unknown metabolites, an information which is crucial for further metabolomic approaches.