Model Hierarchy in mmCIF
Dale Tronrud (DALE@uoxray.uoregon.edu)
Thu, 2 Nov 1995 15:34:47 -0800 (PST)
There have been a number of letters written requesting an
expansion of the structure description hierarchy in mmCIF. This
letter is a long description of the issues surrounding the
implementation of these additional levels.
The current form of the mmCIF description of the contents
of the asymmetric unit of a macromolecular crystal can be
written in the form "chain/residue/atom.variant". mmCIF uses
different terminology (asym, seq_id, atom_id, and ???) but
since I find the mmCIF phrasing very confusing I will use a more
URL-like syntax for this discussion (I am not proposing that
mmCIF change).
Putting together suggestions from Frances and Herbert
Bernstein along with comments from others a fuller description
of the contents of a mmCIF file would be
model//chain.variant/residue.variant/atom.variant
I think this form would fill all the needs expressed in
the newsgroup so far.
The "model//" level would allow several independent models
to be placed in a single mmCIF file. In its simplest use it
could label all the individual NMR models proposed for a molecule.
The variants of the chain allows for models which have been refined
with multiple copies of each chain in an attempt to model disorder.
The difference between chain variants and models is that all the
chain variants are presumed to coexist in time or space where
different models are simply different ways of interpreting the
observations.
Residue variants are used to model sequence heterogeneity.
Atom variants are used to model discreetly disordered atoms.
The latter are reasonably described by the current version of
mmCIF.
Model Issues
I can see the case for placing a whole series of NMR models
in one mmCIF file. It would be quite convenient for the mmCIF
reader to understand the organization of the models and their
relationship to one another. It is also very useful as a
program data structure to have several models described at
one time. If fact I intend to implement this level in TNT's
data structures when time presents itself.
However the number of problems posed by this level of
description for mmCIF rapidly multiplies. Even for NMR models
the problems begin. The first problem is that the statistics
for the agreement of the model and the observations must be
"loop_"ed. I don't know how NMR people measure model quality
but the equivalent for X-ray people would be to loop the R
value and all stereochemistry agreement statistics. Since model
quality will certainly be required data for any deposition
this elaboration will be unavoidable.
It rapidly becomes worse. Take the case of someone comparing
two refinement protocols. It would be most convenient to
store both models in a single file. However now the refinement
parameters, and refinement history would have to be looped over
the models. In addition the target stereochemistry libraries
must be looped because one refinement might be with EREF and
another with PROLSQ.
Now consider a study where a X-ray data set and a neutron
data set are available. It is reasonable to presume a model
could be refined against the X-ray data alone, another against
the neutron data, and a third refined against both. Very
interesting things could be learned by the comparison of the
three models. Now the data set table must be looped. (I
know the current mmCIF does not include back links to the
structure factor files but it should and I am still lobbying.)
As you can see one can construct very reasonable pairings
of models which would result in the duplication of practically
any table in mmCIF. The extreme would be to try to place the
model for T4 lysozyme and Thermolysin in one mmCIF file because
they were both solved in the same lab. In addition the pairings
would depend on the interest of authors of the mmCIF file. Others
might want to pair the same models in different ways.
If the mmCIF committee decides to implement a "model" level
in the mmCIF definition I suggest they place strict limits on
its utility. The models should be determined using the same
analysis procedure of the same data. The structure of the
models must be identical -- There cannot be one sequence in
one model and a different one in another. All fields describing
the agreement of a model with the data must be loop-able.
I think these are the minimal changes required to allow NMR
models and parallel SA runs to be stored in a single mmCIF
file. Trying to go beyond this would be incredibly difficult.
In addition I suggest that you allow the possibility of
defining, in the group defining research articles, a table
of other mmCIF data blocks (in other files) used by that
paper. With this tool the authors could cross connect the
different mmCIF files referenced in a paper to allow for the
retrieval of the related models without appending them all
in a single file. This scheme would allow different papers
to refer to different sets of mmCIF data blocks.
Chain Issues
The chain variants are introduced to allow for the description
of models where several copies of each chain are included in
the refinement in an attempt to model discrete disorder and
nonisotropic B-factors. For example, you could have a hemeoglobin
molecule in the asymmetric unit, resulting in two Alpha chains
(A1, and A2) and two Beta chains (B1, and B2). If a model was
refined with 8 copies of each chain the model would contain
A1.1, A1.2, ... B1.1, ... and B2.8.
This addition is fairly straight forward but there are a few
problems to watch. First, it should be a requirement that all
the chain variants be of exactly the same type. A1.1 should
be the same "entity" as A1.3. In fact the loop defining the
entities for each chain (asym) should not mention the variants
at all.
Second, it should be possible to ignore this level without
having to mark something in the mmCIF file. Most models will
not use this level of description -- Their mmCIF files should
not be cluttered up by being forced to state that all atoms
in chain A1 are in variant 1 if there is no other. I think
even including the "." in one column too confusing. I do not
know the details of mmCIF well enough to know how to construct
the default variant.
Third, if a mmCIF file uses the default variant then the
use of a nondefault variant for that same chain should be
illegal. For example, if there is a chain A1 (no variant
indicator) there cannot also be an A1.1.
Residue Issues
The residue variant field is suggested to handle the problem
of heterogeneity of peptide sequence. If residue 123 is
sometimes a SER and other times a ARG one would define a 123.1
which is identified as SER and a 123.2 as ARG.
It would be more chemically proper to define two chains
A1 and A2 which are different entities but are constrained to
have identical parameters everywhere except for occupancies and
residue 123. While this is a more complete description of the
model the complexity of the constraint makes it impractical.
If my previous suggestion for the explicit definition of
connectivity of residues in entities is adopted the sequence
of these things (along with a great number of other things)
becomes quite easy. In the table of residue types there would
simply be an entry for 123.1 and 123.2. In the connectivity
table links would be defined between 122 and 123.1, 122 and 123.2,
123.1 and 124, and 123.2 and 124.
The first caveat (not in the PDB sense I hope) is that once
again we need a default for the residue variant so that this
field does not lead to confusion for the vast majority of models
which have no need of this construction. If one version of a
residue is declared to be a variant then the default cannot be
used for that residue in another place. You should not allow both
123 and 123.1 in the same chain. They must be 123.1 and 123.2.
In addition, it should be recognized that residue variant
123.1 goes with variant 124.1. This definition is parallel
to that of the atom variant case already implemented.
My final point is a matter of usage. My example above
was that 123.1 is SER and 123.2 is ARG. Suppose that the
ARG side chain makes a salt bridge to 224 (GLU) and that 224
assumes one conformation for 123.1 (SER) and another for 123.2
(ARG). The temptation is to model 224 (GLU) with discrete
disorder using the atom variants. This would not be a proper
description of the model. The proper description is to define
a 224.1 (GLU) and a 224.2 (GLU). Provision must be made in
the definition of heterogenous sequences for such cases where
both "variants" have the same residue type.
Atom Issues
Discrete disorder is modeled quite well in the current mmCIF.
However I would prefer that there was a way to ignore the atom
variant column in mmCIF files which contain no discreetly disordered
components. If possible, there should be a check that one does
not define a CG and a CG.1. If there is a disordered atom then
all copies of the atom must be labeled disordered.
Dale Tronrud