Molecular structure definitions
Eldon Ulrich (elu@gir.nmrfam.wisc.edu)
Wed, 6 Sep 1995 11:54:23 -0500
Molecular Structure Descriptions:
I want to add my support to the comments Dale Tronrud has made concerning
the need to describe a molecule and its structure. I feel it is worthwhile to
revisit the overall goal of describing molecules or entities and a proposed set
of
tokens is outlined below. Within the current mmCIF, it appears that only
single
chain polymers and non-polymers can be described with tokens whose
definitions do not assume that a structure has been determined. In most cases,
the complexity of the system being studied is known to a higher degree than
this
before the data is collected (multimeric molecular structures or complex
molecules are known to be present). To describe these structures, as Dale has
pointed out, additional tokens are needed to define bonds that link polymers,
but
also tokens are needed to describe the polymers that are associated to form a
quarternary molecular structure, to name those structures, and to provide
references to nomenclature systems and other databases.
I am developing a data deposition form to be used by NMR spectroscopists for
submitting information to BioMagResBank and a flat-file format for distributing
data from the databank. We are using the STAR format and want to use data
tokens compatible with the mmCIF tokens, if not identical, wherever possible.
In designing the form, it is useful to define, using one supercategory of
tokens,
the complete chemical structure for each molecule in the system studied.
Unfortunately, the _entity, _chem_bond, and chem_link categories do not
appear to have all of the tokens needed. Below (starting with the comment
"Biological system definition") are listed a large number of tokens in the
format
of the deposition form I am constructing. Many of these tokens or equivalent
ones are available in mmCIF but many are not. I propose that constructs be
added to mmCIF that will allow higher order molecular structures to be
described before experimental data is implied. I am not concerned that the
actual constructs adopted follow the outline listed below. However, I would be
interested in comments on this outline as we plan to make it available to the
NMR community in the near future. Please remember that this is a draft and is
in the format of a deposition form not a dictionary.
With these tokens, the intent is to be able to describe hemoglobin, hexose
kinase, lipoproteins, peptidoglycans, enzyme inhibitor complexes, and hopefully
a wide variety of other structures. The _system tokens are included, because I
need to describe solution systems that may involve the transient interaction
between two or more entities.
All of the structural information has been unified under the 'entity' umbrella.
I
would prefer to use the term 'mol' instead of entity for two reasons: 1)
entity
is a vary broad term that can cover a wide variety of objects. 2) entity has
a
relatively well defined meaning in the database world that causes confusion
when discussing schemas and file formats. The _chem_link_bond tokens in
mmCIF are now listed as _entity_chem_link. Link_bond seemed redundant, but
this is also a minor issue. Also, the term 'label' has been used in many
tokens
where `id' is used in mmCIF. Within our database, we have reserved the term
`id' for numeric values that simply identify a particular instance of a
relationship or row in a table and they do not have any implied meaning, as is
the case with using `VAL' as an `id' for a chemical compound structure in
mmCIF. I think in mmCIF `id' tokens at times do carry contextual information
and at other times are intend as just row markers and that this can be
confusing.
A category `_bond_site' has been constructed similar to `_atom_site' for use in
loops where data relevant to individual bonds may be listed.
Other Considerations:
1. The abbreviation `comp' is used both for compounds and computer.
Possibly `cmptr' could be used in computer related tokens.
2. `exptl' with some fonts appears to mean `experiment one'. Could this be
abbreviated further to `expt' without causing many problems.
3. Additional tokens:
_citation.CAS_AN Chemical Abstracts number
_citation.book_city Is in publisher token, but I think
would be useful on its own.
_citation.keywords Often useful for searching
4. The term monomer or abbreviation `mon' can be confused as meaning either
the monomeric unit of a polymer or a monomer in a dimeric or higher order
structure. Since `mon' is used relatively infrequently in the current mmCIF,
`residue' might be substituted.
5. I would recommend using Chemical Abstracts abbreviations for journals as a
standard.
Best Regards,
Eldon
###########################
# Biological system definition #
###########################
# This section defines the macromolecules and small molecules that form the
# system reported on in this entry. The system may consist of a single
molecule,
# such as ribonuclease, but may also be defined by several molecules as in the
case
of a study
# involving the tryptophan repressor - DNA operator complex formed in the
presence of
# tryptophan. Include the molecules for which NMR data is provided and
those
molecules
# that are significant for the study. Do not include buffers, salts,
solvents, etc
as these should
# be described in the section where the contents of each sample are listed.
save_system
_System.name
_System.detail
loop_
_System_constit.name
_System_constit.label
_System_constit.function
stop_
save_
######################
# Molecule definitions #
######################
# Three classes of molecular structure are defined in this section, multimeric
molecules,
complex molecules, and simple or homogeneous molecules. Multimeric
molecules are defined
as macromolecules that have quaternary structure and are formed by the
association of two or
more subunits each of which may have various degrees of complexity.
Complex molecules are
constructed of covalently linked homo- or hetero-polymers, of complexes
involving tightly
associated but non-covalently linked molecules, or of non-polymer compounds
bound to one or
more polymers. Homogeneous molecules are either polymeric or non-polymeric.
Polymeric
molecules of this class are constructed of one type of monomer (i.e. amino
acids,
deoxyribonucleotides, ribonucleotides, carbohydrates, etc.) Examples of
non-polymeric
molecules would include free amino acids, enzyme prosthetic groups,
substrates, inhibitors,
and other small molecules.
A category, `_entity_chem_struct' is available for defining the atoms and bonds
that make up
a small molecule, a monomer found in a polymer, or a molecular fragment that
is part of a
complex molecule.
Create a `saveframe' block for each molecule that comprises the system being
studied.
######################
# Multimeric molecules #
######################
save_<_entity_multimeric.label>
_entity_multimeric.label
_entity_multimeric.name
_entity_multimeric.details
loop_
_entity_multimeric.subunit_unique_label
_entity_multimeric.subunit_unique_name
_entity_multimeric.subunit_label
stop_
loop_
_entity_multimeric.reference.database_label
_entity_multimeric.reference.database_code
stop_
loop_
_entity_multimeric.class_system_name # Enzyme Commission;
CAS registry
_entity_multimeric.class_system_code # EC number; CAS
registry number
stop_
loop_
_entity_multimeric.synonym
stop_
save_
# The following saveframe is used to declare the molecules that are present in
each
# subunit of a multimeric macromolecule. Each subunit found in a multimeric
macromolecule
# must be assigned a unique identifier so that the locations of individual
atoms
can be
# described when specific data are listed later in the form.
save_<_entity_multimeric.subunit_label>
_entity_multimeric.subunit_label
_entity_multimeric.subunit_name
loop_
_entity_multimeric.subunit_member_unique_label
_entity_multimeric.subunit_member_label
stop_
loop_
_entity_multimeric.subunit_synonym
stop_
save_
#############################
# Complex Molecular structures #
#############################
# Two types of complex molecules are defined: Those that consist of multiple
covalently
# linked polymer structures. For instance, insulin, CD2, and erythropoietin.
And, those that
# are defined by prosthetic groups or other molecules and atoms covalently or
tightly
# associated with a polymer (cytochrome c, flavodoxin, calmodulin with bound
calcium, the
# alpha chain of hemoglobin, etc.)
save_<_entity_complex.label>
_entity_complex.label
_entity_complex.common.name
_entity_complex.formula_weight
_entity_complex.details
loop_
_entity_complex.member_unique_label
_entity_complex.member_label
stop_
loop_
_entity_complex_reference.database_name
_entity_complex_reference.database_code
_entity_complex_reference.details
stop_
loop_
_entity_complex.class_system_name
_entity_complex.class_system_code
stop_
loop_
_entity_complex.synonym
stop_
save_
#######################################################
# Single chain simple polymeric molecules and small molecules #
#######################################################
# Molecules are complete chemical structures of either polymer or non-polymer
type.
# Ribonuclease, lysozyme, water, acetone, and dioxane are a few examples.
save_<_entity.label>
_entity.label
_entity.common.name
_entity_chem_struct.label
_entity.type
_entity_poly.type
_entity.formula_weight
_entity.details
loop_
_entity_reference.database_name
_entity_reference.database_code
_entity_reference.details
stop_
loop_
_entity.class_system_name
_entity.class_system_code
stop_
# The sequence of a polymeric molecule is provided in the following loop.
Standard one-
# letter or three-letter nomenclature for amino acids and nucleotides will be
assumed unless
# indicated otherwise. Any non-standard monomers should be given unique
labels and should
# have their chemical structure and linkage to the polymer described in the
section following
# this loop where unique chemical compounds are described.
loop_
_entity_poly_seq.num
_entity_poly_seq.position_label # Author defined
sequence
# position label.
_entity_poly_seq.mol_chem_struct.label
stop_
loop_
_entity.synonym
stop_
save_
# Structures for complete chemical compounds that are non-polymer entities or
fragments of
# chemical compounds that are monomers used to form polymers and their
linkage within the
# polymer.
save_<_entity_chem_struct.label>
_entity_chem_struct.label
_entity_chem_struct.name
_entity_chem_struct.detail
# List the atoms that comprise the compound, their chirality and formal
charge.
Include
# protons in the atom list.
loop_
_entity_chem_struct_atom.atom_label
_entity_chem_struct_atom.chirality
_entity_chem_struct_atom.charge
stop_
# List the bonds and their type that link the atoms in the compound. Include
bonds to protons
# in the list.
loop_
_entity_chem_struct_bond.label
_entity_chem_struct_bond.atom_label_atom_one
_entity_chem_struct_bond.atom_label_atom_two
_entity_chem_struct_bond.value_order
stop_
save_
#########################
# Molecule chemical links #
#########################
# The covalent bonds that link non-standard monomers within a polymer, one
polymer to
# another, or chemical compounds that are covalently attached to a polymer are
listed here.
save_molecule_chemical_links
loop_
_entity_chem_link.label
_entity_chem_link.mol_multimeric.label_atom_one
_entity_chem_link.mol_multimeric.subunit_unique_label_atom_one
_entity_chem_link.mol_multimeric.subunit_member_unique_label_atom_one
_entity_chem_link.mol_complex.label_atom_one
_entity_chem_link.mol_complex.member_unique_label_atom_one
_entity_chem_link.mol.label_atom_one
_entity_chem_link.mol_poly_seq.num_atom_one
_entity_chem_link.mol_chem_struct.label_atom_one
_entity_chem_link.mol_chem_struct_atom.atom_label_atom_one
_entity_chem_link.mol_multimeric.label_atom_two
_entity_chem_link.mol_multimeric.subunit_unique_label_atom_two
_entity_chem_link.mol_multimeric.subunit_member_unique_label_atom_one
_entity_chem_link.mol_complex.label_atom_two
_entity_chem_link.mol_complex.member_unique_label_atom_two
_entity_chem_link.mol.label_atom_two
_entity_chem_link.mol_poly_seq.num_atom_two
_entity_chem_link.mol_chem_struct.label_atom_two
_entity_chem_link.mol_chem_struct_atom.atom_label_atom_two
_entity_chem_link.value_order
stop_
save_
################################
# Unique atom identification labels #
################################
# The following loop can be used to define an identification label for unique
atoms within a
# molecular structure. In constructing the data tables found below the atom
identification
# label can be used instead of repeating the large number of tokens required
to
define specific
# atoms.
save_atom_identification_labels
loop_
_Atom_site.id
_Atom_site.mol_multimeric.label
_Atom_site.mol_multimeric.subunit_unique_label
_Atom_site.mol_multimeric.subunit_member_unique_label
_Atom_site.mol_complex.label
_Atom_site.mol_complex.member_unique_label
_Atom_site.mol.label
_Atom_site.mol_poly_seq.num
_Atom_site.mol_chem_struct.label
_Atom_site.mol_chem_struct_atom.atom_label
stop_
save_
################################
# Unique bond identification labels #
################################
# As for atoms, this loop can be used to define an identification label for
unique bonds in a
molecular structure.
save_bond_identification_labels
loop_
_Bond_site.id
_Bond_site.mol_multimeric.label_atom_one
_Bond_site.mol_multimeric.subunit_unique_label_atom_one
_Bond_site.mol_multimeric.subunit_member_unique_label_atom_one
_Bond_site.mol_complex.label_atom_one
_Bond_site.mol_complex.member_unique_label_atom_one
_Bond_site.mol.label_atom_one
_Bond_site.mol_poly_seq.num_atom_one
_Bond_site.mol_chem_struct.label_atom_one
_Bond_site.mol_chem_struct_atom.atom_label_atom_one
_Bond_site.mol_multimeric.label_atom_two
_Bond_site.mol_multimeric.subunit_unique_label_atom_two
_Bond_site.mol_multimeric.subunit_member_unique_label_atom_one
_Bond_site.mol_complex.label_atom_two
_Bond_site.mol_complex.member_unique_label_atom_two
_Bond_site.mol.label_atom_two
_Bond_site.mol_poly_seq.num_atom_two
_Bond_site.mol_chem_struct.label_atom_two
_Bond_site.mol_chem_struct_atom.atom_label_atom_two
stop_
save_