EMPIAR data re-use case study

Gerard Kleywegt, Ardan Patwardhan & Andrii Iudin, EMBL-EBI

Latest update: 19 February 2020

Entry EMPIAR-10061 provides a good example of why archiving raw cryo-EM (and general bioimaging) data is important as it enables new science and facilitates and accelerates methods development in this rapidly evolving field.

Dataset

Entry EMPIAR-10061 (https://empiar.org/10061) contains the raw cryo-EM data underpinning what was a breakthrough structure and at the time the highest resolution cryo-EM structure available, the 2.2Å resolution structure of β-galactosidase by the group of Sriram Subramaniam:

  • 2.2 Å resolution cryo-EM structure of β-galactosidase in complex with a cell-permeant inhibitor. Bartesaghi A, Merk A, Banerjee S, Matthies D, Wu X, Milne JL, Subramaniam S, Science 348 1147-1151 (2015).
  • PMID: 25953817
  • EMDB entry EMD-2984 (image below, left)
  • PDB entry 5a1a (image below, right)

At present, the PDB structure has been cited in 166 publications (56 reviews and 110 articles), the EMDB map in 48 publications and the EMPIAR dataset in 8 publications.

The dataset is 12.4 TB in size (until 2019 this was the biggest dataset in EMPIAR) and would not have been trivial to host or to download through ftp. Even with fast Aspera or Globus downloading from EMBL-EBI, this may take anywhere from a couple of hours to as much as a week (depending on the speed of the connection); with ftp it would have taken much longer.

Re-processing

By making the raw data publicly available in EMPIAR, other cryo-EM people have been able to download, analyse and re-process the data. At the time of publication, there was some uncertainty in the community about the claimed resolution. However, since the data was available, anyone with an interest could download and re-process the data and this has been done by quite a few groups. Some of these have deposited the resulting map in EMDB themselves (and sometimes also the model in the PDB). Two notable examples of this are:

  • the original authors reprocessed their own data in 2018 and the resulting 1.9Å map is in EMDB as entry EMD-7770 (image below, left) and the model in the PDB as entry 6cvm (image below, right). This is described in paper 29754826.
  • Sjors Scheres is the author of the Relion software which is the most-used programme for image processing in single-particle cryo-EM studies (~90% of all EMDB entries released in 2019 used Relion). He has used the EMPIAR dataset on at least two occasions:
    • To obtain a 2.2Å map using Relion version 2.0 in 2016 – this map is in EMDB as entry EMD-4116 (paper 27845625) (image below, left).
    • To test version 3.0 of Relion, in 2018 the data was reprocessed to 1.9Å and the map deposited in EMDB as entry EMD-0153 (paper 30412051) (image below, right).

Other re-uses

Other examples that demonstrate how this EMPIAR entry has helped scientists develop new or improved (software) methods include:

Additional types of re-use that have occurred for EMPIAR datasets (but not necessarily this one) include:

  • Use in community challenges where data from EMPIAR is downloaded by multiple independent labs and re-processed using local procedures and software, and the resulting maps (and sometimes models built into those maps) are compared in a community meeting.
  • Use in software tutorials and other training materials.
  • Local use to train new students etc. in the processing of cryo-EM data.
  • Validation of the published map or model obtained with a dataset.

If you know of any additional re-uses of this dataset, please let us know.

Download statistics

Although we don’t track individual downloads, we have been able to determine that between 1 April 2018 and 1 December 2019 (20 months), 108,000 files from this entry have been downloaded by ftp and 127,000 through Aspera (Globus download statistics are not available). The entry itself contains 3079 files but users can choose to download any or all of these files. Assuming all users downloaded the entire dataset, this means that it has been downloaded ~75 times in this period. The entry was released in May 2016 and it seems not unlikely that the entry would have been downloaded at least as many times in the period May 2016-March 2018.

The statistics do not include files that have been downloaded through the EMPIAR mirror in Osaka.