home > depositions > AutoDep AutoDep Deposition Tool contact AutoDep    

Consultative Document for Deposition of Structure Factors

20-November-1998


Summary

This document describes a policy for the deposition of experimental data connected with X-ray crystallographic experiments on macromolecules. This document contains a substantial amount of detail.


The current PDB policy for the deposition of structure factors states:

It is very important that the structure factors are deposited. PDB will soon begin using these data for structure validation.

You may choose to delay release of your structure factors or NMR restraints for up to four years from the date of publication. You must notify the PDB when your paper is published. If you wish the hold to be removed earlier, you must notify the PDB.

PDB has chosen to follow the IUCr guidelines which state that coordinates may be held (before release) no longer than one (1) year and structure factors may be held no longer than four (4) years from the date of publication. PDB is applying the same guideline to NMR restraints data, allowing a maximum hold of four (4) years.

See also the Protein Data Bank quarterly newsletter ( January 1998), for an article on structure factors and the PDB by Joel Sussman. The initiative to archive structure factors for X-ray diffraction studies was the result of strenuous efforts by Joel Sussman of the PDB, see for example the Nature note, Baker, E. N., Blundell, T. L., Vijayan, M., Dodson, E., Dodson, G., Gilliland, G. I. & Sussman, J. L. (1996). Crystallographic Data Deposition. Nature 379, 202.

At present approximately 85% of the X-ray entries are deposited with structure factors and therefore the PDBe recommendation is:

The PDBe Group Policy Proposal is that for X-ray diffraction entries structure factor deposition should be mandatory. The PDBe recommends continuing the hold policy adopted by the wwPDB.

The issues then are:

  1. The format of deposited structure factors
  2. The information content of the deposited file(s)
  3. The use of structure factors in validation at the time of deposition
  4. The format and information content of released structure factor files
  5. The use of structure factors to derive 'confidence levels' within a relational data base for use as search criteria and analysis of data in a relational database
  6. The method for the deposition of multiple sets of structure factors, e.g. MAD or MIR data sets related to the coordinates held for an entry
  7. Adapting to future changes in refinement protocols where multiple data sets are refined together.

1. The format of deposited structure factors

and

4. The format and information content of released structure factor files


The current PDBe policy for AutoDep states:

We encourage you to send your structure factors in ASCII format. If you are using CCP4, we suggest that you run the MTZ2various routine to convert the binary file into ASCII (see the CCP4 web page for mtz2various, which produces reflexion file for MULTAN, SHELX, TNT, X-PLOR, CIF or other ASCII format). For complete information on the CIF format that PDB is now using for structure factors, see the structure factor CIF dictionaryon PDBe's Web Site (here the PDBe mirror URL is given).

See also Protein Data Bank quarterly newsletter for

  • October 1995 for an article on the extensions made by the PDB to the mmCIF dictionary to describe standard structure factor definintions. This work was presented by its primary developer, Dr. Vivian Stojanoff of the BNL Biology Department at the CIF Workshop held during the ACA Annual Meeting in Montreal, Canada.
  • January 1998 for an article Writing Structure Factors in mmCIF using CCP4 by Peter Keller, (the PDBe mirror URL is given here).

The current PDB distribution format for the released structure factors is the result of an extensive cleanup effort of the legacy SF files held at the PDB. Previous to the PDB's design of a standard there was a large number of different formats deposited. This work was initially carried out largely by Dr. Jiansheng Jiang at the PDB. This format is an extension to mmCIF and has added information extracted from the entries.

Currently further work is being carried out by the RCSB, the PDBe and much of the work is prompted from Gerard DVD Kleywegt, Alwyn Jones and Mark Harris at UPPSALA UNIVERSITY where a service, the Uppsala Electron Density Server processes all structure factors. It is hoped that most of the structure factor file problems found by this work will be corrected and re-released.

These files are available from the PDB via

ftp ftp.ebi.ac.uk
cd pub/databases/rcsb/pdb-remediated/data/structures/all/structure_factors
			  

The format adopted uses mmCIF data names for the structure factor values. The essential CELL and SYMMETRY information, together with other annotations, are presented in PDB Format records, as mmCIF comments.

An example PDB file is available.

The PDBe Group Policy Proposal is that the format for both deposition files and for re-distribution files be given in mmCIF format throughout. mmCIF tokens should be used for the structure factor data items and for the associated annotation, that includes the cell dimensions, space group symmetry and the relationship between the structure factor file and the coordinate file for an entry.

Software currently in use for macromolecular crystallography is already capable of producing this format for deposition. For example, CNS has a macro that will produce a deposition file containing associated annotation and structure factors in the example CNS output [This is the work of Paul Adams and Ralf Grosse-Kunstleve].

Note: The current XPLOR and CNS structure factor file is not encouraged as a deposition file as these files are in essence macros to be read by the program control scripts and do not contain essential cell and symmetry information.

For CCP4 the procedure mtz2various will produce a file of format for example using the script:

mtz2various hklin 1718.mtz hklout 1718.cif <<eof
OUTPUT CIF data_1718
LABI FP=mut1718_F SIGFP=mut1718_SIGF
END
eof

which gives this output.

The deposition of binary CCP4 MTZ files has been raised before. There are problems in treating deposited binary MTZ files, these are:

  • the browser upload via PDB's AutoDep2.1 web input tool has not always recognised these files as binary and they are not always readable with for example MTZDUMP at the deposition site.
  • MTZ files after CCP4 revision 1.18 are machine architecture independent using a machine stamp from September 1993. Entries are being deposited that do date before this revision and not all research sites consistently update their software to keep track of changes in all software packages.
  • The main problem is that an MTZ file may contain many columns of information and there is no guaranteed error free procedure to map MTZ column names to mmCIF data names that are correct for relating the structure factors that the associated coordinates were refined against.

The PDBe Group Policy Proposal is therefore NOT to support the deposition of binary MTZ files with the current CCP4 implementation, and advocates that it is reasonable to request files in the well defined mmCIF format for which existing converting software is available. These methods are simple to use and give clearly defined and automatically parasable information.


2. The information content of the deposited file(s)


The PDBe Group Policy Proposal is to request that the minimum information content should be h,k,l, Fobs and sigma_Fobs.

Ideally additional information could be supplied that is sufficient to include the information required to re-generate the final electron density map and the final refinement. The PDBe accepts that the nature and extent of additional data items should come from the crystallographic producers of the data.

A complete list of the current defined mmCIF structure factor data names is given here and links to the full definitions provided by NDB mmCIF web documents are given below, and other links may be found from any PDB mirror site by looking at the file mmcif.html (the PDBe Mirror URL is given here).

An uploaded file containing any of these data tags and associated values will be automatically processed by both the RCSB and PDBe deposition services that are under development. Currently the PDB's AutoDep2.1 service would also be able to handle this type of information as the structure factor files are processed by annotators.


3. The use of structure factors in validation at the time of deposition


Currently structure factors are not used in the PDB's AutoDep deposition procedure. However work is underway to use these with Alwyn Jones' density server software.

The PDBe Group Policy Proposal is to use structure factors at the point of deposition only in a check to match coordinates to the structure factors, in that the cell dimensions, space group symmetry and a standard R-factor calculation gives a value comparable with the deposited value.

At some stage validation for deposition may be extended to use structure factors, giving the depositor an opportunity to comment on a possible gross difference between density correlation and expected values. However in all cases we are keenly aware that a structure determination was carried out to solve a particular problem. It can be argued that no structure is ever finished and one can always tinker away at improving the refinement. The deposition coordinates are a model from a particular experiment for a particular set of reasons. Validation is two fold, firstly to point out at deposition time that there may be extreme geometrical deviations which may be corrected and secondly to give confidence levels for global, per chain, per residue and per atom that can be used in selective search methods and evaluation of the properties of a hit list. Structure factor validation and density correlation factors can be held within the relational database - they are not held in the PDB formatted flatfile.

Note: All PDBe validation procedures used will be those recommended by research initiatives such as the CRITQUAL initiative, [CRITQUAL an EU supported network, CT96-0189 : Coordinator Wilson (York), Jones (Uppsala), Kaptein (Utrecht), Lamzin (EMBL-HH), Thornton (London), Vriend (EMBL-HD), Wodak (Brussels) ], this would include any decision to use tools such as the SFCHECK procedure. As for example see Uppsala Electron Density Server and use of SFCHECK.



The PDBe Group Policy Proposal is to make available the SQL for its relational database and all application software used to derive information held in the data base tables. The PDBe will treat derived structure factor information in much the same manner that for example B-values can be used as a measure of model quality. Search methods will be available to use for example density correlation values as optional selection criteria.


6. The method for the deposition of multiple sets of structure factors, e.g. MAD or MIR data sets related to the coordinates held for an entry


The deposition of derivative and MAD data sets would be welcome at the deposition centres. However, the data has to be labelled in such a way as to be useful. Simply to up-load a number of files to the deposition tool, that were created by the current software converters (e.g. MTZ2VARIOUS) would not give the deposition archive centres sufficient information to automatically relate the various data sets in the correct manner to the coordinates and experimental method(s) given in the annotated PDB entry. The data in the different files needs to be tagged with the correct relationships. To solve this, the PDBe harvest concept is now being pursued by all the deposition centres to encourage software developers to allow for incorporation of common labels for a project.

The mmCIF structure does not allow multiple data sets within the same data_ block (multiple data_ blocks are allowed within the same file). The CCP4 convention of handling multiple data sets, allowing several F_obs values to be related to the same h, k, l columns has no equivalent in either mmCIF nor in CIF rules, i.e. one cannot present data in the form,

data_my_entry 
loop_
_refln.index_h
_refln.index_k
_refln.index_l
_refln.F_meas_[native] 
_refln.F_meas_sigma_[native]
_refln.F_meas_[Derivative_Pb] 
_refln.F_meas_sigma_[Derivative_Pb]
_refln.F_meas_[Derivative_Hg] 
_refln.F_meas_sigma_[Derivative_Hg]
			  

one would be required to deposit in mmCIF as

data_my_entry_native 
 loop_
    _refln.index_h
    _refln.index_k
    _refln.index_l
    _refln.F_meas 
    _refln.F_meas_sigma
data_my_entry_Derivative_Pb
 loop_
    _refln.index_h
    _refln.index_k
    _refln.index_l
    _refln.F_meas
    _refln.F_meas_sigma
data_my_entry_Derivative_Hg
 loop_
   _refln.index_h
   _refln.index_k
   _refln.index_l
   _refln.F_meas 
   _refln.F_meas_sigma
			  

Alternatively each derivative may be deposited using the mmCIF Category Group _phasing_mir items with for example,

loop_
_phasing_mir_refln.index_h
_phasing_mir_refln.index_k
_phasing_mir_refln.index_l
_phasing_mir_refln.der_id
_phasing_mir_refln.F_meas_au
_phasing_mir_refln.F_calc_au
_phasing_mir_refln.phase_calc
_phasing_mir_refln.F_meas_sigma_au
0       0       4       HgCl4    197.8   1351.0   -180.0     1.9
0       0       4       AgNO3_1  206.7   1462.0   -180.0    12.3
0       0       4       AgNO3_2  367.3   1551.0   -180.0    36.7
			  

Complete examples are given in the PDB documentation (see above).

Within each data_block there will be the correct cell dimensions and symmetry for each data set (and other unique properties such as wavelength).

CCP4 are developing a new MTZLIB that will have an extended header to connect each column with a project_name (the in-house equivalent to a PDB idcode) and a data_set_name (the in-house unique identifier per data set associated with the project_name ). The extended header will also carry cell dimensions and symmetry per data set. Once this MTZLIB version is in use by research groups then the automatic deposition of multiple datasets should become simple.


7. Adapting to future changes in refinement protocols where multiple data sets are refined together


The methods used in the determination of macromolecular structures are continually improving and changing. Future depositions may include more joint refinement methods. For example, with electron microscopy being used with X-ray diffraction. Another example is in the development of software to refine multiple sets of structure factors and coordinates in the same run. This can, for say a mutant and native pair of structures give a refinement that can reinforce common features an accentuate the differences. This leads to a situation where the nature of a single PDB entry as meaning one set of coordinates and one set of structure factors that the coordinates were refined against as no longer easily mapped. The mmCIF structure also is not currently capable of coping with more than one set of information per data block by using the same category twice.

The PDBe development data base will be flexible and capable of being extended to map this type of future deposition.

Even in the short term there may well be other innovations in refinement that will require changes to the data base and the export format(s). Developments are anticipated by the flexible design of the PDBe relational database.

If you have any comments about this draft, please Contact the PDBe Group at pdbhelp@ebi.ac.uk