spacer
spacer

Reconstruction Of Gene Networks By Supervised Learning

The short presentation by Lev A. Soinov.

Introduction

Microarray experiments are generating datasets that can help in reconstructing gene networks. One of the most important problems in network reconstruction is finding, for each gene in the network, which genes can affect it and how. We use a supervised learning approach to address this question by building decision-tree-related classifiers, which predict gene expression from the expression data of other genes.

We present algorithms that work for continuous expression levels and do not require a priori discretization. The obtained classifiers can be presented as simple rules defining gene interrelations. In most cases the extracted rules confirm the existing knowledge, while hitherto unknown relationships can be treated as new hypotheses.

Towards reconstruction of gene networks from expression data by supervised learning.
L.A. Soinov§, M.A. Krestyaninova and A. Brazma. Genome Biology, 2003, 4(1): R6. [HTML] [PDF]
§ corresponding author

Towards gene networks reconstruction, step by step:

  • Metabolic networks
  • Protein networks
  • Gene networks represent relationships between genes, based on observations of how the expression level of each gene affects the expression levels of the others

  

We don’t make an a priori discretization of the data like in Boolean models but, instead, we use actual measurements for our predictions:

  • Given a gene g, we predict its state from expression measurements of other genes
  • The gene g is called the predicted gene
  • The genes with which we make the prediction are called the explaining genes

  

  • Supervised learning approach allows identifying the genes that affect the target gene directly from the classifier
  • No arbitrary discretization thresholds are assumed
  • Each data sample is treated as an example
  • Classifiers given in the form of decision trees/tables/rules are easy to interpret

  

To test our approach we chose a small group of yeast genes. These are the cyclin genes CLN1-3 and CLB1-6, and CDC28, MBP1, CDC53, CDC34, SKP1, SWI4-6, HCT1, CDC20, SIC1, and MCM1 , which are involved in cell-cycle regulation and whose interactions are well-described. The same set of genes (with the addition of BCK2 and the exclusion of CLB3, CLB4) was used by Chen et al. [Mol Biol Cell 2000], who presented a mathematical model of the cell-cycle events. Considering such genes made it possible to compare our results with existing knowledge. We used the cdc15, cdc28 and alpha-factor microarray datasets from Spellman et al. [Mol Biol Cell 1998] and Cho et al. [Mol Cell 1998], obtained for S. cerevisiae cell cultures, each synchronised by a different method. We chose the cdc15 experiment for the training dataset because it has the largest number of data points, which, consequently, provided us with the largest number of instances. The accuracy of the classifiers for the cdc15 training set was estimated in three different ways: by 10-fold stratified cross-validation, and with the cdc28 and alpha-factor datasets as test sets. To make the verification of the classification results through the literature searches more straightforward we introduce a representation of classifiers in the form of simple rules. The following language is used for these rules: '+A' means that gene A is 'upregulated'; '-A' that gene A is 'downregulated', '⇔' is used for simultaneous events, and '⇒' is used to distinguish between events that are divided in time. For instance, +A+B-C means that C is 'downregulated' when A and B are 'upregulated'; +B+A means that A is 'upregulated' if B was 'upregulated' (for example, in the previous time point for the time series); ↑AB⇔↓C means that a positive change in the expression level of A along with a simultaneous negative change in the expression of B coincides with a simultaneous negative change of C expression; ↑B⇒↓A means that a positive change in the expression level of B precedes a negative change of A expression. This method of representation allows the decomposition of decision trees of complex structure into simple and compactly presented relations, which can be independently compared to the existing knowledge. Only those classifiers that have high accuracy by all three estimations were selected for constructing “final” decision rules (see the left table below). The three datasets selected for our experiments do not contain all possible information about gene interactions, and it is likely that information about some of the interactions is not in all of them. Taking this into account, our classifier selection procedure is rather conservative and not all rules that are present in the data were extracted. However, we use this conservative approach in order to minimize the possibility of extracting some 'strong' but misleading dependencies by chance, that is, to avoid false positives. The combination of our approach with the follow-up validation of the results by other experimental data could help to confirm the “questionable” rules (see the right table below). These rules have clear biological explanations in the literature, but they failed in one or two of the accuracy tests.

  

Connecting genes according to the extracted rules gives us the network presented below (see the right picture), which is simply a graphical representation of the dependencies between gene-expression levels contained in the extracted decision rules. Every node in this graph represents a gene, and every edge indicates the relation between genes defined by the corresponding decision rule. Of course G2 and G4 may not be direct regulators of G3 (see left picture below), but there is evidence in the data that their transcription levels are connected and the decision tree describes the exact connection function. Instead of having the microarray data discretized a priori and, consequently, with added noise, instead of basing our predictions on unjustified assumptions that regulatory genes can be in two, three, etc. states, we have for each gene in the network a specific set of thresholds sufficient for switching it on or off. An advantage of network reconstruction using this approach is that, given accurate classifiers, one is able to reproduce the architecture and logic of the network consistently with the data. Moreover, one can easily improve classifiers by adding new expression profiles (classification examples) to the dataset.

  

Although here we apply our approach to a relatively small subset of genes, it seems likely that it can be applied to larger gene sets. Time-course data are not the only type of data to which our approach is applicable. It is possible to explore various cases where potential dependencies between different experimental samples might occur.

We would like to re-direct anyone interested in applications of supervised learning to microarray data analysis to our on-line paper [http://genomebiology.com/2003/4/1/R6].

spacer
spacer