BioModels Database logo

BioModels Database

spacer

BioModels Database Annotation Guidelines

This document contains a set of guidelines for the annotation of models, that is, to link model components with terms from controlled vocabularies and other data resources.

This document has been designed with our curators in mind, please refer to our annotation information page for an introduction about model annotation.

Selecting Adequate Accession Numbers

When annotating a component with an external resource reference, the first step is to identify what is exactly this reference. It should be a perennial tag. For instance an "entry name" of UniProt, such as CALM_HUMAN, is not perennial. It is modified on a regular basis to better reflect the classification of the protein. The "Accession", on the contrary, such as P62158, is perennial. Even if some accession numbers are later on downgraded from primary to secondary (for instance when database entries are merged), one can always retrieve the correct UniProt entry based on those accession numbers.

BioModels Database follows the MIRIAM guidelines for annotation and curation, and employs MIRIAM URIs to encode cross references to external resources. The different data collections currently supported by the MIRIAM Registry can be accessed under http://www.ebi.ac.uk/miriam/main/collections/. Some of the entries give usage examples detailing which SBML elements are most likely to be annotated by this specific data type. The CHEBI usage example shows that parameter and species elements can potentially be annotated with CHEBI IDs. BioModels Database, as well as some other tools such as Semantic SBML and SBML editor, make use of this information for preselecting data collections in their annotation interface. If you find an example usage of a data type misleading or missing, you can suggest modifications on its usage page.

It can be hard to find a term of the adequate level of specificity for annotation. In general one should always select the closest and most specific relevant piece of data still general enough to encompass all aspects of the annotated element covered by the data type. In the case of hierarchical knowledge, e.g. controlled vocabularies or classifications, one should carefully choose the level of detail. Sometimes, the finest level is acceptable. For instance, in order to annotate the activation of cdc2 kinase by cyclins in amphibians, one should use the Gene Ontology term GO:0045737 "positive regulation of cyclin dependent protein kinase activity", rather than the more general parent term GO:0000079 "regulation of cyclin dependent protein kinase activity". Indeed, the latter is also the parent of GO:0045736 "negative regulation of cyclin dependent protein kinase activity", which is not adequate to describe the reaction under annotation. On the contrary, and considering a completely different type of knowledge, the model of mitotic oscillator presented in Goldbeter (1991) [BIOMD0000000003 and BIOMD0000000004] describes a generic mechanism of amphibian cell cycle. Therefore, the taxonomy classification 8292 "Amphibia" should be used, rather than the more precises 8355 "Xenopus lævis" or 8401 "Rana esculenta".

A generic annotation is always better than nothing! For instance annotating the dissociation of MAPKKK with MAPKK using the Gene Ontology term GO:0043241 "protein complex disassembly" carries a significant amount of information when it comes to characterise the reaction. It is definitively better that no annotation at all. Another example is the annotation of a particular messenger mRNA KEGG C00046 "Ribonucleic acid".

Qualification of Annotation

The qualification of an annotation is important to grasp the relation between a model component and its annotation. The relationships are rarely one-to-one, and the information content of an annotation is greatly increased if one knows what it represents rather than to know that it is vaguely "related to".

The qualifier of an annotation should reflect the relationships between the biological objects represented by the model element and the annotation:

relation between model and data

The definition of all the qualifiers used by BioModels Database can be found on the BioModels.net website.

The simplest qualifications are the annotation of, or by, an abstracted entity. E.g. a species representing a "cyclin" has version "cig1", "cdc13" etc. Conversely a reaction representing "phosphorylation of cdk2" is version of "phosphorylation of protein". Versions of species are physical modifications, such as conformational states, covalent modifications, etc. The same species in different compartments are not alternative versions.

Finding the correct qualifications can be tricky when the lack of directly relevant annotation forces the use of non-directly related information. To exemplify the problem, let's consider organisms:

  • A model of "xenopus", annotation by "amphibian" data: isVersionOf
  • A model of "amphibian", annotation by "xenopus" data: hasVersion
  • A model of "xenopus", annotation by "frog" data: isHomologTo
  • A model of "amphibian", annotation by "human" data: ? Since "human" is a species, and "amphibian" a group of species, it could be considered as a hasVersion relationship, even if "human" is not a version of "amphibian"

Several sets of annotations can be created for a model component. The sets are homogeneous and different qualifications are stored in different sets. Several sets can exist with the same qualification, and represent alternative, sometimes overlapping, annotations. For instance, if a model reaction represent the combination of three successive biochemical reactions, one can have two sets of has part annotations, one with three EC codes, and one with three KEGG reaction identifiers. In general only the qualifiers hasPart and hasVersion should contain more than one reference in a given set. The concepts represented by the different references in one set must not overlap and should be of the same data type, if possible. In some cases, for example a complex of the protein calmodulin with Ca2+, it has to be a mixture of references to the UniProt entry of calmodulin and the CHEBI entry for Ca2+.

There is only one level of explicit qualification. In addition, an implicit hasVersion is embedded in the sets. If there are two sets of hasPart annotations, both sets are alternative complexes made-up of their parts. When the exact description of the relation between a model component and its annotation would require combination of several qualifiers, a precedence has to be established:

  • hasPart has precedence over hasVersion
  • isPartOf has precedence over isVersionOf
  • hasPart has precedence over isHomologTo

For example, a protein complex of "amphibian" annotated with proteins of "xenopus" should have one hasPart, rather than several hasVersion sets (Note that one hasVersion set with all the annotations would mean that they are alternative versions).

Annotation with SBO terms

SBML models from level 2 version 2 onwards give the option of annotating elements directly with terms from the Systems Biology Ontology (SBO) using the sboTerm attribute. These annotations allow to put another layer of semantics on a model and are for example essential for creating graphical representations such as Systems Biology Graphical Notation (SBGN) diagrams, or converting SBML to other model description formats, such as BioPAX.

For BioModels Database at the following elements should be annotated with SBO terms:

element child of SBO term
compartment SBO:0000290 - physical compartment
species SBO:0000240 - material entity
reaction SBO:0000375 - process
reactant SBO:0000010 - reactant
product SBO:0000011 - product
modifier SBO:0000019 - modifier
kineticLaw SBO:0000001 - rate law
parameter SBO:0000002 - quantitative systems description parameter

Annotating reactants, products, modifiers, rate laws and parameters can be quite a time consuming task, although there exist tools to help with that, for example semanticSBML. Furthermore, Michael Schubert wrote a python script that detects many rate laws automagically and annotates all the above.
Additionally to the above, the model element's sboTerm can be used to indicate the mathematical framework under which the model should be interpreted.

Annotation of the model element

The encoders are all the traceable persons who created or modified the structure of the encoded model. All encoders should be quoted adequately. In particular the initial creators should be tracked if they are not specified explicitly, for instance in the notes elements. If the model has been taken from another data resource, and no curator identity is available, one must quote the creator(s) of the resource.

BioModels Database curators should be quoted as well.

Annotation of species

One should avoid to annotate a species with homologs as much as possible. Sometimes a protein is not described in UniProt, but it could be derived from Ensembl. In such a situation, it is better to annotate with Ensembl than to use an homolog present in UniProt.

Often a model is pretty generic and defines only classes of molecules. The use of controlled vocabularies and hierarchical classification such as InterPro, can help to chose the right level of abstraction (careful with InterPro to use the family branches, not the domain or catalytic site ones). Sometimes, one can nevertheless annotate a component with a particular instance, based on the biochemistry implied in the model. For instance, Hoefnagel et al (2002) [BIOMD0000000017] only defines the species "lactate", created from pyruvate. It seems therefore reasonable to annotate it with the ChEBI term CHEBI:24996 "lactate", rather than any specific isomer. However, one can notice that the authors mention only the Lactate deshydrogenase, and not the D-lactate deshydrogenase. One can thus also annotate the species with KEGG C00186 "(S)-Lactate".

Many external resources do not offer different levels of knowledge. For instance, UniProt database lists proteins, not protein types. When one wants to annotate a generic type of protein, one needs to list all the suitable proteins (or let's say a significant subset, such as all the paralogs in one species). For instance, the model of mitotic oscillator presented in Goldbeter (1991) [BIOMD0000000003 and BIOMD0000000004] describes a species "cdc2k", that could be annotated with P35567 (CDC21_XENLA) and P24033 (CDC22_XENLA), the two forms of cdc2 in Xenopus lævis.

One should be extremely careful not to always completely equate the name of a species with a specific biochemical entity. For instance, the creator of the model described in Curtot et al (1998) [BIOMD0000000015] used the term "ATP" to call the species X4 described in the paper. This species actually represents the sum Adenosine+AMP+ADP+ATP. As a consequence, not only the species "ATP" has to be annotated 4 times, but the reactions involving the species "ATP" are actually sets of 4 different reactions (some actually never happening in Nature, but that's another story).

One has to be very careful with the hits returned from search engines. They can be very misleading. A bad annotation unfortunately spread widely through wrong associations. For instance on February 2005, a search of the database BIND with "acetylcholine" returned the complex made-up of p25, CDK5 and PCTAIRE-motif protein kinase 1. As far as the author - who spent 12 years working on acetylcholine receptors - knows, there is no relationship between this complex and any aspects of acetylcholine physiology.

One should only use Gene Ontology terms coming from the Cellular Component vocabulary to annotate species. One should never use terms coming from the Molecular Function or Biological Process, even if they fit with the function of the species. Those vocabulary should be used to annotate reactions, rules and events instead.

Annotation of Reactions

Although one should annotate a reaction with an EC code, a KEGG reaction, an IntAct or BIND identifier, sometimes this is just not possible. If the reaction is an abstract summary of a linear pathway, one should annotate it with all the relevant codes, as one would annotate a multimolecular complex with the identifiers relevant for all its components. For instance the reaction "den" of Curtot et al (1998) [BIOMD0000000015] correspond to the KEGG reactions R01072, R01127, R04144, R04208, R04209, R04325, R04463, R04559, R04560, R04591! If the reaction is a generic one, sometimes a Gene Ontology term is sufficient, such as GO:0006308 "DA catabolism" for the reaction "dnag" of of Curtot et al (1998) [BIOMD0000000015]. As usual, anything is better than nothing, providing that the information is not misleading.

It happens that several reactions are possible, which link a given set of substrates to a given set of products. Often they differ by the differential use of small molecules. A careful reading of the paper can help picking the right annotation. For instance in of Curtot et al (1998) [BIOMD0000000015], a reaction links GMP and XMP, with ATP as a modifier. Two reactions correspond in KEGG, R01230 and R01231, the former producing ammonia, while the latter producing L-glutamate. However, in appendix A of the paper, one can read that the reaction "gmps" correspond to XML+ATP+glutamine=>GMP+AMP+Pi. Bingo.

Sometimes, a look at the publications listed in KEGG is sufficient to help the decision. For instance, S-adenosylmethioninamine is transformed into 5'-methylthioadenosine by two reactions, using the spermine synthase (EC 2.5.1.22) and the sym-norspermidine synthase (EC 2.5.1.23). However, a look at the references shows that the first reaction has been studied in rat and bovine, while the second has been studied in the simpler eukaryot Euglena. If you want to annotate a human pathway, as in Curtot et al (1998) [BIOMD0000000015], the former is a reasonable choice.


spacer
spacer