Paradigm Shifts in the Approaches for Gene Annotation.
This special issue of "Briefings in Bioinformatics" reports on the proceedings from the recently concluded symposium on "Genome Based Gene Structure Determination" conducted at the EMBL European Bioinformatics Institute (EBI) during June 1-2, 2000. This symposium grew from a previous workshop, entitled "DNA Sequence Based Gene Prediction", conducted at the EBI in February, 1999. While the previous workshop focussed largely on sequence based methods for gene identification in short to medium length DNA regions, this year's symposium addressed gene annotation in genomic sequences as large as complete chromosomes. The clear transition in gene annotation approaches to accommodate complete genomes may be seen in developments such as the creation of automated genome annotation pipelines and the emergence of approaches for comparative genome annotation. Identification of individual gene structural elements is better placed when they are addressed in the context of one another and thus leads to development of genome based approaches. The meeting was particularly timely considering that the announcement of the completion of the first human genome sequence draft is imminent and that more than 50 other eukaryotic genome projects are underway currently.
Introduction to the Scope of the Symposium
Annotation of genes is an integral part of every genome-sequencing project. It involves the determination of gene structures such as coding regions, promoters, and transcription regulatory elements. However, the annotation of each new genome is a challenging task since:
- Regions within a genome can differ in features such as gene density and GC content;
The symposium addressed the state-of-art in the efforts involving both the methodologies and informatics necessary to annotate large genomes.
Basic Methodologies for Identifying Gene Structural Elements
The major immediate interests of the genome projects are in the identification of protein coding regions. However, a complete description of gene structure necessitates identification of the associated sites which signal the different processes in the gene to protein pathway. Such sites include promoters, transcription start and end points, poly-adenylation sites, splice sites, and translation start and stop sites. In addition, regulatory regions form an important functional component of gene structure. Indeed, gene regulation may utilise alternatives in promoters, splice sites and translation start sites. Accurate identification of coding regions is aided by the identification of such sites, and vice versa
3. Identification of regulatory sites is more accurate when they are viewed in the context of other surrounding elements. For example, identification of promoters can be aided by first modelling the organisation of promoters and transcription factor binding sites around a gene. Comparative sequence analysis can then help to refine these models, which may then be used to predict promoters in new genomic sequences4.Different methods have been developed over the last decade to identify genes. The current methods for identification of coding regions in genes are of three types: signal based; content based such as codon usage; and similarity based. Different statistical and mathematical techniques have been successfully used to identify the structural elements (exons, introns, promoters, splice sites, etc) of genes. The methods for structural element prediction include decision tree approaches
5, discriminant analysis6 and other statistical approaches7,8 such as hidden Markov models. Often dynamic programming techniques are used to arrive at an optimal gene assembly of the predicted exons. Combining ab initio gene finding methods with database matches and other experimental information clearly improves the performance and makes the algorithms more tolerant towards errors and uncertainties in sequences8,9. In a similar manner, combining results from different gene finding tools invariably leads to a higher accuracy of gene prediction10,11. Genome annotation projects routinely use more than one gene prediction program to identify ‘consistent’ exons. For example, gene annotation on human chromosome 21 was carried out using MZEF, GenScan, and Grail. Comparison of syntenic regions from closely related species (e.g. mouse and human) or more generally genome sequence comparisons are emerging as trends for accurately identifying genes7,12,13 (and other structural elements such as regulatory regions14), where putative genes are often indicated by evolutionarily conserved fragments in the syntenic regions. Methods based on genome comparisons for predicting regulatory regions in bacterial genomes have given better results than other methods14. The rationale behind these methods is that the gene order and functional regions are conserved in closely related species. Integration of such techniques with the ab initio methods gives better results through lower false positive rates, even when the ab initio methods used poor rules. Fast and efficient sequence comparison algorithms and tools that enable the analysis of large genomic sequences need to be developed, and the PipMaker tool is an interesting example of one such approach12.Approaches utilising diverse sources of information, such as genomic sequence and gene expression data, to identify gene structural elements are emerging. Clustering of gene expression data followed by identification of shared sequence patterns from the upstream sequences of the genes in each cluster enable prediction of putative promoter sequences
15. Such approaches will be very useful as systematic gene expression experiments are carried out.
Large-scale Genome Annotation Efforts
It is now apparent that the bottleneck in genomics is no longer in sequencing the genomes, but lies in their annotation. Large-scale annotation efforts require handling massive amounts of genome data through automated pipelines, with a need to combine diverse sources of data and methods. In addition, it requires visualisation tools to manually examine the automatic annotation, since integration of human expertise to assess the validity and authenticity of all computational results goes a long way to improve the quality of gene annotation. The "Annotation Jamboree", a collaboration between Celera, the Berkeley Drosophila Genome Project, and a team of experts on the annotation of the Adh region of Drosophila, is an exemplary attempt on how to transform the process of manual annotation into a high-throughput operation
16. Integrated pipelines for on-going genome projects require an ability to see one region of the genome as a single sequence, although the underlying data and analysis occurs on each raw, submitted sequence independently. Ensembl, which annotates human genome sequence automatically and keeps the annotation up-to-date as the sequencing progresses, provides an illustrative example of such automated annotation systems17. The Genome Channel and Genome Catalog, developed and maintained by the Genome Annotation Consortium at the Oak Ridge National Laboratory are examples of other automated systems being developed. These systems provide sequences of human and other model species, as well as microbial genomes in a rich annotated view18. These systems provide powerful query interfaces to examine genes in their genomic context as well as across species.
Genome Annotation Experiences with Complete Chromosomes
Judicious use of the available methods and data is required to annotate large genomic regions with combinations of these methods being used to derive the ‘best guess’ genes. Available experimental data as well as similarity to known proteins or ESTs are used to authenticate the predicted genes. The results of annotating the chromosomes completed currently highlight the challenges to be faced for the annotation of complete genomes
19,20. For example, the results on human chromosomes 21 and 22 put together raise an interesting question as to the total number of genes present in human genome. Is it as low as 40,000, in contrast to the earlier estimates of 100,000 - 140,000? Such a question makes one wonder whether an organism as complex as a human requires a large set of genes, or whether the complexity lies in the manner by which a smaller number of genes are regulated and expressed.However, lessons learnt from the work on human chromosome 22 indicate that current automatic gene prediction tools alone are not reliable enough to predict genes accurately and precisely but they are valuable when combined with other information
20. Gene finding programs, that are usually trained on data sets of short DNA regions, show lower accuracy values with large genome data sets comprising either experimentally well-analysed regions21 or short single-gene genomic sequences with randomly generated intergenic regions7. The predicted genes with no supporting homology or other experimental evidence continue to be a cause of worry. Ab initio gene finding must be closely integrated with experimental work to validate the predicted genes, especially such ‘orphan genes’.
Gene Annotation is a Continual Process
As new knowledge and data are generated (e.g. from further mRNA, EST, and protein sequencing), ab initio methods improved, and genome databases from more species made available to enable comparative genomic studies to be done, then so the genome annotation needs to be updated. The genome databases need to have flexible data structures to enable continual updates on the annotations. This requires standardised formats for the results of gene annotations. C. elegans, for which the gene annotation was completed in 1998, provides a case study to illustrate the requirements of ongoing annotation updates
22.
Beyond Gene Annotation
Once the genes are identified, they need to be categorised in terms of the molecular and cellular organisation of the encoded proteins. Such categorisation will help in the query of genome databases at the level of specific function(s) of a gene product, the role(s) it plays in cellular processes, and its localization and associations. As more genome databases become available, it is essential that categorisation is carried out using a controlled set of shared vocabulary terms to describe the gene products based on current knowledge. The Gene Ontology Consortium is leading the development in this direction by annotating protein sequences from mouse, Drosophila and yeast using a standardised ontology
23. Furthermore, the annotated gene features need to be integrated with other biological knowledge. The Mouse Genome Informatics web site at The Jackson Laboratory provides an illustrative example and provides integrated access to mouse genome, expression and genome sequence databases24, 25.
The Immediate Future and Excitement
Without doubt, in the coming years, genome-related activities are going to take place at an ever greater pace. An increasing number of completed genomes will become available and thus more experience with genome annotation. This will lead to improvements in methods and approaches and hence more accurate genome annotation. As well as identification of coding regions, more efforts need to be directed at identifying the more difficult regions in genomes, such as promoters and regulatory regions. Issues such as identifying alternative transcripts that involve multiple choices at the level of promoters or splicing will become prominent, especially as it is estimated currently that one in every three human genes undergoes alternative splicing. Comparative genome analysis will play a significant role in genome annotation leading to a reduction in the number of predicted genes with no supporting evidence. More informatics-related techniques will be adopted to handle the software and data-handling needs of genome annotation.
To paraphrase T. S. Elliot†, it is certainly clear that in genomics and gene characterisation, we are not at 'the beginning of the end', but only 'the end of the beginning'!
Summary of the articles presented in this special issue
This special issue of Briefings in Bioinformatics features seven articles illustrating some of the gene annotation issues discussed so far in this editorial.
Zhang illustrates the concept and the use of discriminant analysis, a powerful statistical technique for classification, in the identification of gene structural elements27. The article discusses in particular a resultant tool, namely MZEF, which identifies exons in a given genome.
Thanaraj & Robinson discuss their specialized tool (based on decision tree models) for accurate splice site prediction5. They demonstrate the integration of their tool with other publicly available programs (currently the chosen program is Zhang’s MZEF) for exon prediction and illustrate the resulting improvements in the accuracy of exon prediction.
Gelfand et al discusses the utility of comparative genome approaches in delineating regulatory regions in a given microbial genome28. Their approach is based on the assumption that sets of correlated genes are conserved in related species and hence the knowledge about regulation from a well-studied genome can be transferred to another genome. The approach and the results are discussed in the light of considering either closely related genomes or sufficiently diverged genomes.
Werner discusses the ways to identify transcription elements on large genomic sequences by a combinatory approach29 involving (i) localization of transcription factor binding sites and regulatory elements, (ii) use of comparative genomics in the identification of new regulatory elements, and construction of organization models for promoters. He presents his PromoterInspector, a software tool that identifies promoter regions.
Wiehe et al discusses the underlying rationale in the approaches for comparative genome annotation, namely that the non-functional regions are more susceptible to accumulate mutations as compared to functional regions30. They further present the following two tools: (i) syntenic gene prediction program that combines the results of sequence alignment between two closely related species with the results of ab initio gene prediction programs; and (ii) PipMaker that identifies regions of similarity between two large genomic sequences and thereby identifies conserved genomic regions.
Larimer reviews31 an appropriate resource, namely HOBACGEN - homologous bacterial genes database, for use in comparative genome studies. It is required, in the approach of annotating genes in a given genome, to derive homology relationships shared among genes from more than one genome. There is an underlying task of generating and extending multiple alignments and phylogenetic trees. Web based resources such as HOBACGEN provide pre-generated families of protein sequences (classified by similarities, phylogenetic trees and taxonomy) to aid in such tasks.
Mayer et al presents an overview of the methods, strategies and resources to carry out analysis of the genome from the flowering plant, A. thaliana32. The results presented include gene annotation, chromosomal organization and comparative genome features.
Stevens et al addresses the need to have an ontology-based knowledge representation in the bioinformatics activities33. The article, in addition to reviewing the current bio-ontologies, discusses methods not only to develop them but also to effectively access them.
T. A. Thanaraj
Alan Robinson
Juha Muilu
Jean-Jack Riethoven
References: