EMDB data model

Current map and core data model formats


Dictionary development for the Electron Microscopy Data Bank

New deposition and annotation (D&A) system


The Worldwide Protein Data Bank (wwPDB) and EMDataBank are jointly developing a new deposition and annotation (D&A) system. The aim of this new system is to facilitate the process of deposition of biomacromolecular structure data and to provide tools for validation. With an expected life span of at least 10 years for the new D&A system, the underlying data model used to describe EM experiments needs to be able to capture the important aspects of the various EM methodologies and needs to be sufficiently flexible to adapt to changes and new developments that are bound to occur in this rapidly evolving field.

The new data model has been implemented and will be maintained in XML schema. For the purposes of the D&A system it will be translated into mmCif.


Core data model


Macromolecule and complex description: Macromolecules correspond to basic types (protein, DNA, RNA, saccharide, lipid, ligand, EM label) and they match wwPDB representation. Complexes are any combination of macromolecules. This multilevel organization can be used to describe samples in single particle EM, macromolecular tomography, and cellular tomography.


Better connection to other biological databases: Currently, EMDB entries are linked to external resources, such as, Gene ontology, InterPro, PubMed, digital object identifier, NCBI taxonomy and wwPDB. In a continuous effort to increase the value of EMDB entries we are assessing the linkage to other biological databases.

  • Citations: PubMed, digital object identifier
  • Ligands: PubChem, DrugBank, ChEBI, ChEMBL
  • Lipids: Lipidomics database (LMSD)
  • Complexes: Gene ontology
  • Proteins: InterPro, Uniprot, Enzyme classification, wwPDB
  • DNA/RNA: RefSeq, Genbank
  • Carbohydrates: CardBank
  • Taxonomy: NCBI taxonomy

Better support for advances in the EM field:

  • IHRS helical processing
  • Subtomogram averaging processing
  • Tomography processing
  • New microscopy devices, etc..

Segmentation data model

Introduction

Segmentation is the decomposition of 3D volumes into regions that can be associated with defined objects. Following several consultations with the EM community (Patwardhan et al., 2012; Patwardhan et al., 2014; Patwardhan et al., 2017), the EMDB is in the process of developing tools to support deposition of volume segmentations with structured biological annotation which is here defined as the association of data with identifiers (e.g., accession codes from UniProt) and ontologies taken from well established bioinformatics resources. To our knowledge, none of the segmentation formats widely used in electron microscopy and related fields currently support structured biological annotation. Third party use of segmentations is further impeded by the prevalence of segmentation file formats and their lack of interoperability. EMDB therefore proposed an open segmentation file format called EMDB-SFF to capture basic segmentation data from application-specific segmentation file formats and provide the means for structured biological annotation. In this way, EMDB-SFF will not only enable depositions of segmentations but also act as a file interchange format between different applications and facilitate analysis of 3D reconstructions. Furthermore EMDB-SFF supports the description of multiple transforms for a segment, thus allowing a segment to be used to describe the placement of a sub-tomogram average onto a tomographic reconstruction.


Model

    EMDB-SFF files have the follow features:
  • Segmentation metadata:
    • name
    • version (of schema)
    • details (free-form text)
    • global external references, e.g. specimen scientific identifier
    • bounding box
    • primary descriptor contained i.e. one of ‘threeDVolume’, ‘meshList’, ‘contourList’, or ‘shapePrimitiveList’ (see schema documentation)
    • path to original segmentation file
    • list of transforms referenced by segments e.g. transform to place the sub-tomogram average in the tomogram
  • Hierarchical ordering of segments through the use of segment IDs and parent IDs;
  • Four geometrical representations of segments (volumes, contours, meshes, shapes);
  • Can store subtomogram averages and how they map into the parent tomogram through the use of transforms;
  • List of associated external references per segment;
  • List of associated complexes and macromolecules in a related EMDB entry

Each segment in a segmentation can consist of two types of descriptors:

  • textual descriptors;
  • geometric descriptors.

Textual descriptors consist of either free-form text or standardised terms. Standard terms should be provided from a [published] ontology or list of identifiers.


Geometric descriptors can take one or more of the following representations:

  • ‘threeDVolume’ for 3D volumes;
  • ‘contourList’ for lists of contours each of which is a series of 3D points;
  • ‘meshList’ for lists of meshes each of which consists of a set of vertices and polygons;
  • lists of shape primitives (ellipsoid, cuboid, cone, cylinder).

Download

The current schema (version 0.6.0a4) is available here.

Documentation

Complete documentation of the schema is available here.

Segmentation and Transformations Working Group

A working group (segtrans-wg) has been set up to receive contributions from EM practitioners, software developers and ontologists. Please sign up here.


Auxiliary Tools

sfftk

sfftk provides a shell command and a Python API to process EMDB-SFF files.

The following utilities are available using sfftk:

  • sff convert: Conversion of application-specific segmentation file formats to EMDB-SFF. Currently, sfftk supports the following formats:
    • AmiraMesh (.am)
    • Amira HyperSurface (.surf)
    • Segger (.seg)
    • EMDB Map masks (.map)
    • Stereolithography (.stl)
    • IMOD (.mod)
  • sff notes: Annotation of EMDB-SFF files.
  • sff view: Brief summaries of segmentation files.

Download

The latest development version (version 0.1.dev0) of sfftk may be downloaded/installed from PyPI or the source may be obtained from CCP-EM.


Publications

  1. Patwardhan, Ardan, Robert Brandt, Sarah J. Butcher, Lucy Collinson, David Gault, Kay Grünewald, Corey Hecksel et al. Building bridges between cellular and molecular structural biology. eLife 6 (2017).
  2. Patwardhan, Ardan, Alun Ashton, Robert Brandt, Sarah Butcher, Raffaella Carzaniga, Wah Chiu, Lucy Collinson et al. A 3D cellular context for the macromolecular world. Nature structural & molecular biology 21, no. 10 (2014): 841-845.
  3. Patwardhan, Ardan, José-Maria Carazo, Bridget Carragher, Richard Henderson, J. Bernard Heymann, Emma Hill, Grant J. Jensen et al. Data management challenges in three-dimensional EM. Nature structural & molecular biology 19, no. 12 (2012): 1203-1207.

Fourier shell correlation (FSC) model

Fourier shell correlation is the most widely used method for assessing resolution of maps deposited to the EMDB archive.


  • Current FSC model
  • Proposed FSC model
  • The FSC curve depends on a number of parameters such as the mask and symmetry applied, and its interpretation depends on the threshold criteria. The new FSC data model provides the means to comprehensively describe these factors.