spacer
spacer

Microarray Databases

<<< 2/2 >>>

ArrayExpress

Microarrays are already producing massive amounts of data. These data, like genome sequence data, can help us to gain insights into underlying biological processes only if they are carefully recorded and stored in databases, where they can be queried, compared and analysed by different computer software programs. The EBI is currently establishing a public repository for microarray gene expression data ArrayExpress, analogous to EMBL-bank for DNA sequence data. In many respects gene expression databases are inherently more complex than sequence databases (this does not mean that developing, maintaining and curating the sequence databases are any less challenging).

Conceptually, a gene expression database can be regarded as consisting of three parts - the gene expression data matrix, gene annotation and sample annotation, see picture below.



Gene expression data have meaning only in the context of the particular biological sample and the exact conditions under which the samples were taken. For instance, if we are interested in finding out how different cell types react to treatments with various chemical compounds, we must record unambiguous information about the cell types and compounds used in the experiments. EBI is participating in an effort to develop ontologies for sample annotation, this is analogous to gene ontology for gene description.

Gene annotation can be taken care of to some extent by links to sequence databases, unfortunately complicated many-to-many relationships between genes in the gene expression matrix and the features (spots) on the array make it necessary to provide a full and detailed description of each feature on the array, as one gene can relate to several features on the array. The lack of standards in gene naming is another difficulty - a table relating each array feature present in the database to the list of all synonymous names of the respective gene is an essential part of a gene expression database.

The microarray technology is still rapidly developing, therefore it is natural that currently there are no established standards for microarray experiments and how the raw data should be processed. There are also no standard measurement units for gene expression levels. In the lack of such standards the information about how exactly the gene expression data matrix was obtained should be kept in the database, if the data are to be properly interpreted later.

ArrayExpress is storing all this information, the details of which is called Minimum Information About a Microarray Experiment (MIAME) defined by the Microarray Gene Expression Database (MGED) consortium. MGED is a grass roots movement that was founded at a meeting at the EBI in 1999, is supported by most of the important players in the microarray community, and has evolved far beyond the EBI.

Another repository for gene expression data GEO is being developed at NCBI in the US. DDBJ in Japan also have plans. All three groups face similar problems and are involved in MGED to some degree. A common data exchange format MAGE-ML is being developed in collaboration between MGED (with active participation of the EBI) and some major microarray companies.


Gene expression data analysis and Expression Profiler

Capturing and storage of microarray data is not an end in itself. The amounts of data from even a single microarray experiment are so large, that software tools have to be used to make any sense out of it. Clustering and class prediction are typical methods currently used in gene expression data analysis (see Microarray Data Analysis). One of the popular gene expression data analysis tools is Expression Profiler , developed at the EBI. The Microarray Informatics Team at the EBI is actively working in many microarray data analysis areas using this and other tools.

An example of such research is an approach to reverse engineering of gene regulatory networks, which is based on the hypothesis that genes that have similar expression profiles (i.e., similar rows in the gene expression matrix) should also have similar regulation mechanisms as there must be a reason why their expression is similar under a variety of conditions. Therefore, if we cluster the genes by similarities in their expression profiles and take sets of promoter sequences from genes in such clusters, some of these sets of sequences may contain a 'signal' as a specific sequence pattern such as a particular substring, which is relevant to regulation of these genes (Vilo et al. 2000).


Microarray Databases <<< 2/2 >>>




Reference:

This article has been contributed by Alvis Brazma, Helen Parkinson, Thomas Schlitt and Mohammadreza Shojatalab.



spacer
spacer