mmCIF Issues Pending

Paula Fitzgerald (paula_fitzgerald@Merck.Com)
Thu, 31 Aug 95 14:58:37 EDT

Hello again -

It seemed useful to me to make a distinction between that which I have done
and that which remains to be dealt with, which is why I am keeping them in
separate messages.  The discussion in the message tend to be longer, so
what I will do is introduce each new section with my usual separator of

- - - - -

Frances Bernstien writes:

     Under save__atom_site.calc_flag you have

;              A standard code to signal if the site data has been determined
               by diffraction data or calculated from the geometry of
               surrounding sites, or has been assigned dummy coordinates. The
               abbreviation 'c' may be used in place of 'calc'.

    _item_enumeration.detail      d
                                 'determined from diffraction measurements'
                                 'calculated from molecular geometry'
                                 'abbreviation for "calc"'
                                 'dummy site with meaningless coordinates'

My personal suggestion would be not to allow any abbreviations here.  Seeing
'c' I probably would think 'calculated' but seeing 'd' one could easily think
'dummy' instead of 'determined' or 'diffraction'.  I would suggest 'det' (or
'diff' or 'data' or 'meas'), 'calc', and 'dum' as the codes.  My personal
preference would be 'meas'.  If you really want one letter codes then why not
use 'm' or 'meas', 'c' or 'calc', and 'd' or 'dum'?

- -

I think there is some history here, and we have to keep c and calc in order
to be able to read files written under the definitions in the original CIF
core dictionary.  But this will take some looking into.

- - - - -

Frances Berstein writes:

     I am trying to understand how mmCIF handles microheterogeneity because
we have already had that in at least one entry.  After looking at the
dictionary I have a few questions:

1. The item _entity_poly_seq.hetero is described as

;              A flag to indicate whether or not this monomer in the polymer is
               heterogeneous in sequence.  This would be a rare phenomenon.

and it is not mandatory.  Shouldn't it be mandatory if there is
microheterogeneity?  This leads to a more general issue:  I could only find
yes or no as values in the _item.mandatory_code fields throughout the
dictionary.  Should there be a way to show that something is mandatory under
certain conditions.  (Note also that microheterogeneity does not occur
often in PDB entries but I think "rare" might be too extreme.)

2. I am not completely clear on how you propose to handle microheterogeneity
When I look at

;              Data items in the ENTITY_POLY_SEQ category specify the sequence
               of monomers in a polymer.  Allowance is made for the possibility
               of microheterogeneity in a sample by allowing a given sequence
               number to be correlated with more than one monomer id - the
               corresponding ATOM_SITE entries should reflect this

it seems to say that, in the case of microheterogeneity one should
repeat _entity_poly_seq.num with the same residue number for each possible
residue in the case of microheterogeneity, as follows:

   1   1  ALA
   1   2  GLY
   1   3  SER
   1   3  VAL
   1   4  PRO

in the case of there being SER/VAL microheterogeneity at residue 3.
If this the representation that is intended, then there appears to be a
conflict with

;              The value of _entity_poly_seq.num must uniquely and sequentially
               identify a record in the ENTITY_POLY_SEQ list.

               Note that this item must be a number, and that the sequence
               numbers must progress in increasing numerical order.

which does not allow for a number to be repeated.

If I understood the intended representation of microheterogeneity in the
entity_poly_seq section, then should the atom_site information basically

    ATOM N  N   SER  A   3  .  23.664  33.855  16.884  1.00  22.08  .  1   3
    ATOM C  CA  SER  A   3  .  22.623  34.850  17.093  1.00  23.44  .  1   3
    ATOM C  C   SER  A   3  .  22.657  35.113  18.610  1.00  25.77  .  1   3
    ATOM O  O   SER  A   3  .  23.123  34.250  19.406  1.00  26.28  .  1   3
    ATOM C  CB  SER  A   3  .  21.236  34.463  16.492  1.00  22.67  .  1   3
    ATOM N  N   VAL  A   3  .  23.664  33.855  16.884  1.00  22.08  .  1   3
    ATOM C  CA  VAL  A   3  .  22.623  34.850  17.093  1.00  23.44  .  1   3
    ATOM C  C   VAL  A   3  .  22.657  35.113  18.610  1.00  25.77  .  1   3
    ATOM O  O   VAL  A   3  .  23.123  34.250  19.406  1.00  26.28  .  1   3
    ATOM C  CB  VAL  A   3  .  21.236  34.463  16.492  1.00  22.67  .  1   3

I particularly care about the fields:
- -

Frances rightly points out that there is a problem with our current mode of
representing microheterogeneity.  I thought we had done with correctly when
the data items were first created, but later I had one of those horrible
realizations that the pointers were not clean in this regard.  I'm still not
sure how to solve the problem, but we will eventually find a way.

- - - - -

Frances Bernstein writes:

     In file
Helen has a residue +A in the field _atom_site.label_comp_id.  She also
has a section for CHEM_COMP that includes +A and describes it.
The mmCIF dictionary description says that _atom_site.label_comp_id
is a pointer to in the CHEM_COMP cetegory.  When I look
at in the dictionary it says

;              The value of must uniquely identify each item in
               the CHEM_COMP list.

               For protein polymer entities, this is the three-letter code for
               amino acids.

               For nucleic acid polymer entities, this is the one-letter code
               for the bases.

     Thus I am puzzled by the fact that the entry used +A when the dictionary
appears to say that this field should be the one-letter code.  Or should
the dictionary be modified to allow things like +A?

- - 

Here we can probably solve the logical problem just by rewording the
definition, but I want a chance to consult with Helen about this before
doing something that still might not be correct.

- - - - -

Eldon Ulrich writes -

I have a few questions on constructing chemical structures.

1.  How would a mixed polymer of nucleic acids and deoxynucleic acids be
described?  Would one type of monomer be considered standard and the others
given non-standard ids that would then be linked to the standard structures.

2.  Within the ENTITY and CHEM_LINK_BOND sections it does not seem possible to
describe how a non-standard amino acid is linked to adjacent monomers.  For
example how to describe iso-aspartyl group linked through the side-chain
carboxyl to the following amino acid.  I could not find away to get from this
section back to a specific set of two residues in the sequence of a polymer.

- -

I think I can answer Eldon's questions by just sitting calmly for a moment
and thinking about them, but I don't have that moment right now, so this too
will have to wait.

- - - - -

I also have pending a series of questions from Dale Tronrud, but since
I haven't even begun to think about them yet, I haven't included them in
this summary.

If you guys have thoughts about the issues outlined above, don't be shy 
about letting us know.  Talk to you all soon.
 Dr. Paula M. D. Fitzgerald  ______________ voice and FAX: (908) 594-5510
   Merck Research Laboratories ______________ email:
     P.O. Box 2000, Ry50-105     ______________ or           
       Rahway, NJ 07065  USA 
         (for express mail use 126 E. Lincoln Ave. instead of P. O. Box 2000)