Model Hierarchy in mmCIF

Dale Tronrud (DALE@uoxray.uoregon.edu)
Thu, 2 Nov 1995 15:34:47 -0800 (PST)
Messages sorted by: [ date ][ subject ][ author ]
Previous message: Frances Bernstein: "missing definitions"
	   There have been a number of letters written requesting an
	expansion of the structure description hierarchy in mmCIF.  This
	letter is a long description of the issues surrounding the
	implementation of these additional levels.

	   The current form of the mmCIF description of the contents
	of the asymmetric unit of a macromolecular crystal can be
	written in the form "chain/residue/atom.variant".  mmCIF uses
	different terminology (asym, seq_id, atom_id, and ???) but
	since I find the mmCIF phrasing very confusing I will use a more
	URL-like syntax for this discussion (I am not proposing that
	mmCIF change).

	   Putting together suggestions from Frances and Herbert
	Bernstein along with comments from others a fuller description
	of the contents of a mmCIF file would be

		model//chain.variant/residue.variant/atom.variant

	I think this form would fill all the needs expressed in
	the newsgroup so far.

	   The "model//" level would allow several independent models
	to be placed in a single mmCIF file.  In its simplest use it
	could label all the individual NMR models proposed for a molecule.
	The variants of the chain allows for models which have been refined
	with multiple copies of each chain in an attempt to model disorder.
	The difference between chain variants and models is that all the
	chain variants are presumed to coexist in time or space where
	different models are simply different ways of interpreting the
	observations.

	   Residue variants are used to model sequence heterogeneity.
	Atom variants are used to model discreetly disordered atoms.
	The latter are reasonably described by the current version of
	mmCIF.

	Model Issues

	   I can see the case for placing a whole series of NMR models
	in one mmCIF file.  It would be quite convenient for the mmCIF
	reader to understand the organization of the models and their
	relationship to one another.  It is also very useful as a
	program data structure to have several models described at
	one time.  If fact I intend to implement this level in TNT's
	data structures when time presents itself.

	   However the number of problems posed by this level of
	description for mmCIF rapidly multiplies.  Even for NMR models
	the problems begin.  The first problem is that the statistics
	for the agreement of the model and the observations must be
	"loop_"ed.  I don't know how NMR people measure model quality
	but the equivalent for X-ray people would be to loop the R
	value and all stereochemistry agreement statistics.  Since model
	quality will certainly be required data for any deposition
	this elaboration will be unavoidable.

	   It rapidly becomes worse.  Take the case of someone comparing
	two refinement protocols.  It would be most convenient to
	store both models in a single file.  However now the refinement
	parameters, and refinement history would have to be looped over
	the models.  In addition the target stereochemistry libraries
	must be looped because one refinement might be with EREF and
	another with PROLSQ.

	   Now consider a study where a X-ray data set and a neutron
	data set are available.  It is reasonable to presume a model
	could be refined against the X-ray data alone, another against
	the neutron data, and a third refined against both.  Very
	interesting things could be learned by the comparison of the
	three models.  Now the data set table must be looped.  (I
	know the current mmCIF does not include back links to the
	structure factor files but it should and I am still lobbying.)

	   As you can see one can construct very reasonable pairings
	of models which would result in the duplication of practically
	any table in mmCIF.  The extreme would be to try to place the
	model for T4 lysozyme and Thermolysin in one mmCIF file because
	they were both solved in the same lab.  In addition the pairings
	would depend on the interest of authors of the mmCIF file.  Others
	might want to pair the same models in different ways.

	   If the mmCIF committee decides to implement a "model" level
	in the mmCIF definition I suggest they place strict limits on
	its utility.  The models should be determined using the same
	analysis procedure of the same data.  The structure of the
	models must be identical -- There cannot be one sequence in
	one model and a different one in another.  All fields describing
	the agreement of a model with the data must be loop-able.

	   I think these are the minimal changes required to allow NMR
	models and parallel SA runs to be stored in a single mmCIF
	file.  Trying to go beyond this would be incredibly difficult.

	   In addition I suggest that you allow the possibility of
	defining, in the group defining research articles, a table
	of other mmCIF data blocks (in other files) used by that
	paper.  With this tool the authors could cross connect the
	different mmCIF files referenced in a paper to allow for the
	retrieval of the related models without appending them all
	in a single file.  This scheme would allow different papers
	to refer to different sets of mmCIF data blocks.

	Chain Issues

	   The chain variants are introduced to allow for the description
	of models where several copies of each chain are included in
	the refinement in an attempt to model discrete disorder and
	nonisotropic B-factors.  For example, you could have a hemeoglobin
	molecule in the asymmetric unit, resulting in two Alpha chains
	(A1, and A2) and two Beta chains (B1, and B2).  If a model was
	refined with 8 copies of each chain the model would contain
	A1.1, A1.2, ... B1.1, ... and B2.8.

	   This addition is fairly straight forward but there are a few
	problems to watch.  First, it should be a requirement that all
	the chain variants be of exactly the same type.  A1.1 should
	be the same "entity" as A1.3.  In fact the loop defining the
	entities for each chain (asym) should not mention the variants
	at all.

	   Second, it should be possible to ignore this level without
	having to mark something in the mmCIF file.  Most models will
	not use this level of description -- Their mmCIF files should
	not be cluttered up by being forced to state that all atoms
	in chain A1 are in variant 1 if there is no other.  I think
	even including the "." in one column too confusing.  I do not
	know the details of mmCIF well enough to know how to construct
	the default variant.

	   Third, if a mmCIF file uses the default variant then the
	use of a nondefault variant for that same chain should be
	illegal.  For example, if there is a chain A1 (no variant
	indicator) there cannot also be an A1.1.

	Residue Issues

	   The residue variant field is suggested to handle the problem
	of heterogeneity of peptide sequence.  If residue 123 is
	sometimes a SER and other times a ARG one would define a 123.1
	which is identified as SER and a 123.2 as ARG.

	   It would be more chemically proper to define two chains
	A1 and A2 which are different entities but are constrained to
	have identical parameters everywhere except for occupancies and
	residue 123.  While this is a more complete description of the
	model the complexity of the constraint makes it impractical.

	   If my previous suggestion for the explicit definition of
	connectivity of residues in entities is adopted the sequence
	of these things (along with a great number of other things)
	becomes quite easy.  In the table of residue types there would
	simply be an entry for 123.1 and 123.2.  In the connectivity
	table links would be defined between 122 and 123.1, 122 and 123.2,
	123.1 and 124, and 123.2 and 124.

	   The first caveat (not in the PDB sense I hope) is that once
	again we need a default for the residue variant so that this
	field does not lead to confusion for the vast majority of models
	which have no need of this construction.  If one version of a
	residue is declared to be a variant then the default cannot be
	used for that residue in another place.  You should not allow both
	123 and 123.1 in the same chain.  They must be 123.1 and 123.2.

	   In addition, it should be recognized that residue variant
	123.1 goes with variant 124.1.  This definition is parallel
	to that of the atom variant case already implemented.

	   My final point is a matter of usage.  My example above
	was that 123.1 is SER and 123.2 is ARG.  Suppose that the
	ARG side chain makes a salt bridge to 224 (GLU) and that 224
	assumes one conformation for 123.1 (SER) and another for 123.2
	(ARG).  The temptation is to model 224 (GLU) with discrete
	disorder using the atom variants.  This would not be a proper
	description of the model.  The proper description is to define
	a 224.1 (GLU) and a 224.2 (GLU).  Provision must be made in
	the definition of heterogenous sequences for such cases where
	both "variants" have the same residue type.

Atom Issues

	   Discrete disorder is modeled quite well in the current mmCIF.
	However I would prefer that there was a way to ignore the atom
	variant column in mmCIF files which contain no discreetly disordered
	components.  If possible, there should be a check that one does
	not define a CG and a CG.1.  If there is a disordered atom then
	all copies of the atom must be labeled disordered.

							Dale Tronrud
Previous message: Frances Bernstein: "missing definitions"