This document contains a set of guidelines for the annotation of a model, that is, to link model components with terms from controlled vocabularies and other data resources.
When annotating a component with an external resource reference, the first step is to identify what is exactly this reference. It should be a perennial tag. For instance an "entry name" of UniProt, such as CALM_HUMAN, is not perennial. It is modified on a regular basis to reflect better the classification of the protein. On the contrary an "accession", such as P62158, is perennial. Even if some accession numbers are later on downgraded from primary to secondary (for instance when database entries are merged), one can always retrieve the correct UniProt entry based on those accession numbers.
One should always select the closest relevant piece of data. In the case of hierarchical knowledge, e.g. controlled vocabularies or classifications, one should carefully choose the level of detail. Sometimes, the finest level is acceptable. For instance, in order to annotate the activation of cdc2 kinase by cyclins in amphibians, one should use the Gene Ontology term GO:0045737 "positive regulation of cyclin dependent protein kinase activity", rather than the more general parent term GO:0000079 "regulation of cyclin dependent protein kinase activity". Indeed, the latter is also the parent of GO:0045736 "negative regulation of cyclin dependent protein kinase activity", which is not adequat to describe the reaction under annotation. On the contrary, and considering a completely different type of knowledge, the model of mitotic oscillator presented in Goldbeter (1991) [BIOMD0000000003 and BIOMD0000000004] describes a generic mechanism of amphibian cell cycle. Therefore, the taxonomy classification 8292 "Amphibia" should be used, rather than the more precises 8355 "Xenopus lævis" or 8401 "Rana esculenta".
A generic annotation is better than nothing! For instance annotating the dissociation of MAPKKK with MAPKK using the Gene Ontology term GO:0043241 "protein complex disassembly" carries a significant amount of information when it comes to characterise the reaction. It is definitively better that no annotation at all. Another example is the annotation of a particular messenger mRNA KEGG C00046 "Ribonucleic acid".
The qualification of an annotation is important to grasp the relation between a model component and its annotation. The relationships is rarely one-to-one, and the information content of an annotation is greatly increased if one knows what it represents rather than to know it is "related to".
The qualifier of an annotation should reflect the relationships between the biological objects represented by the model element and the annotation:
The definition of all the qualifiers used by BioModels Database can be found on the BioModels.net website.
The simplest qualifications are the annotation of, or by, an abstracted entity. E.g. a species representing a "cyclin" has version "cig1", "cdc13" etc. Conversely a reaction representing "phosphorylation of cdk2" is version of "phosphorylation of protein". Version of species are physical modifications, such as conformational states, covalent modifications etc. The same species in different compartments are not alternative versions.
This process can be tricky when the lack of directly relevant annotation forces the use of non-directly related information. To examplify the problem, let's consider organisms:
Several sets of annotations can be created for a model component. The sets are homogenous and different qualifications are stored in different sets. Several sets can exist with the came qualification. For instance, if a model reaction represent the combination of three successive biochemical reactions, one can have two sets of has part annotations, one with three EC codes, and one with three KEGG reaction identifiers.
There is only one level of explicit qualification. In addition, an implicit hasVersion is embedded in the sets. If there are two sets of hasPart annotations, both sets are alternative complexes made-up of their parts. When the exact description of the relation between a model component and its annotation would require combination of several qualifiers, a precedence has to be established:
For example, a proteic complex of "amphibian" annotated with proteins of "xenopus" should have one Has Part set, rather than several Has Version sets (Note that one Has Versionset with all the annotations would mean that they are alternative versions).
The creators are the persons who created or modified the structure of the encoded model. All creators should be quoted adequately. In particular the initial creators should be tracked if they are not specified, for instance in notes elements. If the model has been taken from another data resource, and no curator identity is available, one musts quote the creator(s) of the resource.
BioModels Database curators should be quoted as well.
One should avoid to annotate a species with homologs as much as possible. Sometimes a protein is not described in UniProt, but it could be derived from Ensembl. In such a situation, it is better to annotate with Ensembl than to use an homolog present in UniProt.
Often a model is pretty generic and defines only classes of molecules. The use of controlled vocabularies and hierarchical classification such as InterPro, can help to chose the right level of abstraction (careful with InterPro to use the family branches, not the domain or catalytic site ones). Sometimes, one can nevertheless annotate a component with a particular instance, based on the biochemistry implied in the model. For instance, Hoefnagel et al (2002) [BIOMD0000000017] only defines the species "lactate", created from pyruvate. It seems therefore reasonable to annotate it with the ChEBI term CHEBI:24996 "lactate", rather than any specific isomer. However, one can notice that the authors mention only the Lactate deshydrogenase, and not the D-lactate deshydrogenase. One can thus also annotate the species with KEGG C00186 "(S)-Lactate".
Many external resources do not offer different levels of knowledge. For instance, UniProt database lists proteins, not protein types. When one wants to annotate a generic type of protein, one needs to list all the suitable proteins (or let's say a significant subset, such as all the paralogs in one species). For instance, the model of mitotic oscillator presented in Goldbeter (1991) [BIOMD0000000003 and BIOMD0000000004] describes a species "cdc2k", that could be annotated with P35567 (CDC21_XENLA) and P24033 (CDC22_XENLA), the two forms of cdc2 in Xenopus lævis.
One should be extremely careful not to always completely equate the name of a species with a specific biochemical entity. For instance, the creator of the model described in Curtot et al (1998) [BIOMD0000000015] used the term "ATP" to call the species X4 described in the paper. This species actually represents the sum Adenosine+AMP+ADP+ATP. As a consequence, not only the species "ATP" has to be annotated 4 times, but the reactions involving the species "ATP" are actualy sets of 4 different reactions (some actually never happening in Nature, but that's another story).
One has to be very careful with the hits returned from search engines. They can be very misleading. A bad annotation unfortunately spread widely through wrong associations. For instance on February 2005, a search of the database BIND with "acetylcholine" returned the complex made-up of p25, CDK5 and PCTAIRE-motif protein kinase 1. As far as the author - who spent 12 years working on acetylcholine receptors - knows, there is no relationship between this complex and any aspects of acetylcholine physiology.
One should only use Gene Ontology terms coming from the Cellular Component vocabulary to annotate species. One should never use terms coming from the Molecular Function or Biological Process, even if they fit with the function of the species. Those vocabulary should be used to annotate reactions, rules and events instead.
Although one should annotate a reaction with an EC code, a KEGG reaction, an IntAct or BIND identifier, sometimes this is just not possible. If the reaction is an abstract summary of a linear pathway, one should annotate it with all the relevant codes, as one would annotate a multimolecular complex with the identifiers relevant for all its components. For instance the reaction "den" of Curtot et al (1998) [BIOMD0000000015] correspond to the KEGG reactions R01072, R01127, R04144, R04208, R04209, R04325, R04463, R04559, R04560, R04591! If the reaction is a generic one, sometimes a Gene Ontology term is sufficient, such as GO:0006308 "DA catabolism" for the reaction "dnag" of of Curtot et al (1998) [BIOMD0000000015]. As usual, anything is better than nothing, providing that the information is not misleading.
It happens that several reactions are possible, which link a given set of substrates to a given set of products. Often they differ by the differential use of small molecules. A careful reading of the paper can help picking the right annotation. For instance in of Curtot et al (1998) [BIOMD0000000015], a reaction links GMP and XMP, with ATP as a modifier. Two reactions correspond in KEGG, R01230 and R01231, the former producing ammonia, while the latter producing L-glutamate. However, in appendix A of the paper, one can read that the reaction "gmps" correspond to XML+ATP+glutamine=>GMP+AMP+Pi. Bingo.
Sometimes, a look at the publications listed in KEGG is sufficient to help the decision. For instance, S-adenosylmethioninamine is transformed into 5'-methylthioadenosine by two reactions, using the spermine synthase (EC 2.5.1.22) and the sym-norspermidine synthase (EC 2.5.1.23). However, a look at the references shows that the first reaction has been studied in rat and bovine, while the second has been studied in the simpler eukaryot Euglena. If you want to annotate a human pathway, as in Curtot et al (1998) [BIOMD0000000015], the former is a reasonable choice.