Limitations of annotation enrichment


The main limitations of annotation enrichment come from the annotations themselves. Certain areas of biology are more thoroughly annotated and better described than others, with more detail and more accurate terms for well-known processes. For example, at the level of the proteins, more "popular" proteins are better annotated. This introduces a certain bias into the statistical analysis.

It is also important to note that GO terms can be assigned either by a human curator who performs careful, manual annotation or by computational approaches that use the basis of manual annotation to infer which terms would properly describe uncharted gene products. They use a number of different criteria that always refer to annotated gene products, such as sequence or structural similarity or phylogenetic closeness. The importance of the computationally derived annotations is quite significant, since they account for roughly 99% of the annotations that can be found in GO.

Simplifying the interpretation of annotation enrichment results

Another limitation of annotation enrichment is the complexity and detail of annotation associated with large gene or protein sets. This happens because resources such as Reactome and, especially, GO can be very complex and detailed in their annotation leading to the generation of overwhelmingly complicated networks of inter-related and similar terms. There are several ways to try and unravel this complexity.

The simplest approach is to use simplified ontologies. Many tools offer this option and use ontologies where fine detailed terms are removed and assigned to broader, more general parent terms. In GO, these simplified ontologies are called GOslims.

Other tools, such as the Cytoscape apps BiNGO or ClueGO, represent the results as a network of terms, where directed edges represent term relationships as defined in the ontology used. This allows tools from graph theory to be used to reorganise the layout of the network to uncover communities inside these terms networks which helps to simplify the output. BiNGO only provides the network view, so other tools are required to further simplify the analysis. ClueGO makes use of network analysis tools and Cohen's kappa coefficient to offer a simplified view of the results, grouping terms by similarity and offering much more interpretable results.

Finally, there are tools that are specifically devoted to simplifying the task of interpreting annotation enrichment results. The Cytoscape EnrichmentMap app is a very good example. It can use the output from some of the most popular annotation enrichment tools, such as DAVID, BiNGO, g:Profiler or the more sophisticated GSEA, and render it in the shape of clustered networks. The tool applies clustering and automatic layout techniques to overlap similar gene sets and provide a simplified representation of annotation enrichment results. It is especially useful when comparing results obtained from different sets, for example, those representing two different conditions.  

In summary, it is important to know the limitations of the annotation resource you are using to perform this type of analysis. It is also important to be aware of the inherent complexity of the results. Network analysis techniques can help simplify the interpretation of these results.