Model Data Blocks and Diffraction Data Blocks
Dale Tronrud (DALE@nickel.uoregon.edu)
Mon, 16 Oct 1995 23:40:50 -0700 (PDT)
I want to discuss some ideas I have about the separation
of the mmCIF data into a diffraction data block and a model
data block. I know the committee will not want to hear such
basic matters being questioned at this time. All I want
to accomplish is to make the community aware of the problems.
As I understand the situation, the mmCIF committee and the
PDB have decided to allow the diffraction data to be stored
in one file and models to be stored in another. I think it
is very reasonable to divide the data in this fashion because
of the different natures of the two kinds of data. However,
whenever you split data you have to examine each data group
and decide which direction it must go. I disagree with the
details of this split as it appears to be implemented.
First, we must recognize that the mapping between these to
data blocks is many to many. For a given diffraction data set
there will be many models. For a given model there many be
several data sets upon which it is based (e. g. a X-ray data
block and a neutron data block, or X-ray and NMR data). As I
understand the current division for these files, the diffraction
file contains the ID of the model data block but the model data
block does not contain a pointer to the diffraction data block.
It is difficult to place a table of models in the diffraction
data block because this file could be constructed prior to the
solution of the structure (and the structure might never be solved
leaving an empty list). In my work I generate many models which
need to be passed between program packages and would in a perfect
world be in mmCIF. The table in the diffraction mmCIF file would
have to be updated almost daily. I suggest that the list of
models which depend on a diffraction data block be optional. In
an archive where both model and diffraction data blocks are stored
you could put in a table complete within that restricted universe
of models. In the lab you would not maintain this table.
However, it would be easy to have the diffraction data blocks
listed in the model data block because the programs need to know
that information anyway. This field should be mandatory for
any model based on diffraction data. However because there may
be multiple diffraction data blocks the definition of this
dependency must be in a loop construction.
Because many models will be based upon any particular diffraction
file the calculated F's cannot be stored with the observed F's.
While I recognize some (but not much) utility to storing the
calculated F's, if you are going to have them they must be in
the model's data block -- They are a property of the model alone.
The many-to-many relationship between model and diffraction data
cannot be represented when the Fc's are in the diffraction data
file.
The second problem is what data groups goes into each data
block?
Currently the data collection, data reduction, and agreement
statistics are stored in the model file. These data belong in
the diffraction file. They do not change when the model changes
and it would be redundant to write the same values over and over
again. You also would have to place it all in loops to cover each
diffraction data set. With this information in the diffraction
data block life becomes much simpler. You do not have to place
your statistics inside of loops nor do you have to the confusion
of listing diffraction intensities from multiple crystals in a
single loop.
For structures currently in the PDB without deposited structure
factors one could construct small diffraction data blocks to
contain the statistics but without structure factors. These
would be like the current PDB files which contain no coordinates.
However they would contain the proper cross links (hyperlinks?)
and data dependencies. Their presence would make clear the huge
gaps in deposition of these data and might encourage some people
to deposit older diffraction data sets.
The final point I would like to make is the most outlandish,
but does come as a natural progression of these thoughts. The
cell constants are not a property of the model and should not
be stored in the model's data block. If you have a model which
was refined against two diffraction patterns in all likelyhood
you will have two different sets of cell constants. In the
models already deposited the sets will be very similar but there
are cases where people have refined models with constrained ncs
between nonisomorphic crystal forms. In such a case each
diffraction pattern would not only have unrelated cell constants
but different space groups. The cell constants and space group
belongs in the diffraction data block.
Placing the cell constants in the diffraction data block
immediately solves another problem. What are the cell constants
of an NMR structure? The concept does not apply. If you
place the cell constants in the diffraction data block you can
make their presence mandatory and not affect NMR or theoretical
models' validation. Currently the provision of the cell constants
cannot be mandatory.
Related to the cell constants is the deorthogonalization
matrix. In fact the deorthogonalization matrix is a composite
of two things, the cell constants and the convention. The cell
constants are a function of the diffraction data block (which
indicates that there cannot be a single deorthogonalization matrix
because there may be more than one crystal type). This implies that
the deorthogonalization matrix should be in the diffraction data
block. However it is possible that differing conventions might be
used in different models implying that the orthogonalization
convention should be in the model data block. Since mmCIF seems to
want the matrix and not its convention you must have a loop
construction in the model data block which identifies each
diffraction data block and the deorthogonalization convention used
to move the model into that crystal's coordinate system.
It would be cleaner to simply list the convention and not the
matrix but I don't know of a good way to do this in general.
Currently mmCIF has the cell constants, the convention, and the
matrix listed (or listable). This information is redundant and
should be consistent with itself. Without a standard form to
describe the convention I would not like to be assigned the job
of writing the validation software.
If there is interest in this approach I could put more time
into filling in the details.
Dale Tronrud