BioModels Curation Guidelines

This document contains a set of guidelines for the curation phase of a model, that is, the analysis of the model itself and its correspondence to the publication. It is assumed that the models are already encoded in a syntactically correct form: 1) They are proper XML; 2) They comply to the SBML schema; 3) They passed the SBML consistency checks.

Reading the Publication

One should always fully read the text of the model reference publication, and look at the figures only when it is unclear. It is counter-intuitive, but the diagrams are often simplified (e.g. missing species), or they represent a biased or partial view of the paper contents (when there are several models in the paper, for instance). Sometimes they are even completely wrong, because the drawing has been borrowed from a previous work by the same team, and no longer represents the current model.

A tricky step is to identify the organism, or the set of organisms, used to generate the quantitative data. Quite often this requires the reading of some other experimental studies quoted by the authors.

One should retrieve the source of the parameters actually used by the authors, in order to identify the right reacting species. This information is sometimes indirect. For instance, Levchenko et al (2000) [BIOMD0000000011 and BIOMD0000000014] claim to base their model on the one by Huang and Ferrell (1996) [BIOMD0000000009]. Therefore, it would be sensible to think that the species they represent by MAPKKK is MOS, like in the quoted paper. Until you know that they actually took the structure of the model from Huang and Ferrell, but they use the numerical values of Bhalla and Iyengar (DOCQS), corresponding to RAF instead (personal communication).

Semantic Curation

Check that the species are actually species and not parameters, and that all reactions should be reactions rather than rules. Some models use Reactions instead of rateRules to represent the change of values that are not species, such as the fraction of acetylation, the voltage etc. Although superficially of the same form, the math elements of kineticLaw and rateRule are very different. The kineticLaw has to provide substance/time. And a species has to represent a pool of entities. It can only be expressed in unit of substance or substance/size. A species cannot be expressed in millivolt!

Check that the spirit of the model is preserved. In particular, one can often "simplify" models, by suppressing rules, stoichiometryMath etc., and fixing the values of compartment, species and parameters. However in some case, this can be misleading, and can even falsify the model when a numerical value is changed. For instance, if instead of setting properly the size of compartments, one can fix them to unity and modifies the rate constants and stoichiometry accordingly. The resulting model will be fine for a specific set of values, but any modification, e.g. of a compartment size, will result in a model providing wrong results.

Check the missing attribute reversible of the reaction elements. The default value is true. Some people think it is the other way around, and omit the attribute to mean false. The SBML is valid, but no longer represents the intention of the authors.

Check the attributes constant and boundaryCondition of the element species. Some people are confused by their meaning, taken one for the other, and some people make mistakes about their default value.

Check the units! If the units are not specified, it could well be that the built-in defaults differ from the units of the publication. In such a case, one should redefine the built-in units with unitDefinition elements.

    <unitDefinition id="substance" metaid="metaid_0000029" name="millimole (default)">
        <unit kind="mole" scale="-3"/>
    <unitDefinition id="time" metaid="metaid_0000030" name="minute (default)">
        <unit kind="second" multiplier="60"/>

It is not enough to set-up properly the parameter and species units. The kineticLaw provides a value in substance/time. Therefore the substance and the time should be properly set-up to reproduce the paper results.

Note that the best way to build a robust model is to use not only consistant, but also homogeneous units. To use multiple units in order to get single or double digits quantities is a BAD PRACTISE. One should use engineering notation instead. Do not use 2 micromoles and 5 nanomoles, use 2 micromoles and 5e10-3 micromoles or 2e3 nanomoles and 5 nanomoles instead.

Check that the kineticLaw contains a rate-law providing amounts rather than concentrations. The key is to understand if the symbol of the species (their id) represents a numerical value in substance/size or in substance. It depends on the attributes of the compartment and the species:

non-0 (default) 0
hasOnlySubstanceUnits false (default) substance/size substance
true substance substance

If the symbol of all species is expressed in substance, it is often sufficient to multiply the "standard" rate law by the the size of the compartment (multi-compartment reactions are more tricky). If the symbol of the species is expressed in substance/size, one can only let the rate law as it is if it is linear. For instance, if we consider a Michaelis-Menten type of reaction:

(Vmax * Speciessubstance)/(Km + Speciessubstance)

is not equivalent to:

[ (Vmax * Speciesconcentration)/(Km + Speciesconcentration) * volume

Whenever possible, one should use local parameters rather then global parameters. The use of global parameters contradicts the good practise of encapsulation and clutters the general namespace.

Check the stoichiometry of all species. Some models use "sink" species to simulate the creation and destruction of entities. Since ODEs do not need to be reconstructed for those species, some modelers set their stoichiometry to 0. This is not allowed by SBML.

Modification of SBML Elements

While the content of most of SBML elements is not meant to provide information outside the context of a simulation, some of them should nevertheless be human readable. The most obvious example is the content of the attribute name for the elements model, species, reactions and parameter, but sensible name for unitDefinition is also a plus. A meaningful name facilitates enormously the work of the annotator.

A quick way to edit SBML elements is to use SBMLeditor.

The name of a model should refer to the publication, and provide some hints about the topic of the model. "Elowitz2000_Repressilator" describes the oscillator built on three repressors, presented in Elowitz and Leibler (2000) [BIOMD00000000012]. "Goldbeter1991_MinMtiOscill" describes the minimal mitotic oscillator published by Albert Goldbeter in 1991 [BIOMD0000000003 and BIOMD0000000004].

When several models are presented in a paper, one should add another hint to permit the disambiguation. For instance, "Levchencho2000_MAPK_Scaffold" [BIOMD0000000014] and "Levchencho2000_MAPK_noScaffold" [BIOMD0000000011] are two models proposed in Levchenko et al (2000). Similarly "Edelstein1996_EPSP_AChSpecies" [BIOMD0000000002] relates to the simulation of an excitatory post-synaptic potential published by Stuart Edelstein in 1996, where acetylcholine is treated as a species, while "Edelstein1996_EPSP_AChEvent" [BIOMD0000000001] relates to the same model, but where acetylcholine is represented by an event.

The name of a parameter should be kept small and reflect the paper. To rename Km1 into "Michaelis constant 1" is confusing more than anything.

The name of a species should permit to unambiguously identify its role within the model. It should be long enough for this purpose, but one should remember that it is to be included in human readable formula. Therefore its length should be reasonnable, typically one word. Those words can nevertheles be complicated. In Curtot et al (1998) [BIOMD0000000015], a single pool represents the concentration of ATP, ADP, AMP and Adenosine, resulting in one species. The id of the original model was set to ATP. This is very confusing and misleading, because without a deep knowledge of the paper, a reader of the model would infer that the pool represents only ATP. Therefore the name was set by the curator to ATP_ADP_AMP_Ado. Everything being equal, one should use the name used in the article. However, they are often confusing (after all, the authors did not primarily design their model for databasing). In some models of Markevich et al 2004 [BIOMD0000000028, BIOMD0000000029, BIOMD0000000030, BIOMD0000000031], the two phosphorylation sites of MAPK are differentiated. Therefore, two variants of each complex MAPK_MAPKK or MAPK_MKP exist, according to the position of the enzyme, on the threonine site, or the tyrosine site. Species have been named accordingly, e.g. MAPK_MAPKK_Y and MAPK_MAPKK_T, rather than M_MAPKK and M_MAPKK* as in the article.

The name of a reaction should consisely describes what it does. The reaction "activation of cdc2 kinase" describes the activation of cdc2 by cyclin, whether the cyclin is mentioned as a modifier or not (modification via a rule).

The attribute id and name are optional for the element event. However, for the sake of BioModels, it would be sensible to add them when they are missing.

The model components will be annotated. Therefore, one should not add unnecessary information in the name, such as detailed function. If one really wants to explain something about a a compartment, a parameter, a species or a reaction, the respective notes should be used.

Although unit names are not primarily meant to be read by human, it is sensible to homogeneise them. If one use "nanomoleperlitre" for one unit, one should use "nanomolepermin" rather than "nmpermin" for another. Remember that name is encoded in Unicode, therefore "namole per litre" can be written "nmol·l-1". [NB: M is the abbreviation of "mole per litre", not the abbreviation of "mole".]

Article correspondance

Each model should be run in simulators to check if it actually reflects the paper. Possible software tools (with good SBML support or with input format provided by our convertors) are CellDesigner, COPASI, Jarnac, MathSBML (necessitates Mathematica), RoadRunner, SBMLdeSolver, SBMLeditor, SemanticSBML and XPP-Aut.

At least one figure should be well reproduce. Be careful to both axis of curves. It is not sufficient that plots "look-like" the figures of the papers. They should reproduce the results, including absolute values. To be validated, the results should be reproduced with a different tool that the one used to produce the original published figure.

Some information about the curation should be written down in notes within the model element. One should at least states the simulator used to successfully run the model.

To facilitate the future consistent annotation process, we advise to add as much annotation as possible during the curation. One possibility to automaticaly annotate SBML models is to use SBMLannotate