Submitting epigenomic data

This page details a checklist of minimal information that we expect from data submitters to the European Nucleotide Archive (ENA) when describing raw data sets from next generation sequencing platforms used in high-throughput studies of epigenetic features. We present this checklist in order to practically assist those preparing their data for submission to the ENA. We do not propose that the information described as mandatory in the list below is necessarily sufficient for successful reproduction of experimental findings and wish to note that the broader reporting standard, MINSEQE, exists that serve this purpose. Since information additional to the minimal checklist presented here may be required for MINSEQE compliance and to raise the level of utility, we expect that other publication/presentation resources will also be in use by our epigenomics data submitters and that the ENA submission of raw data represents just one of several components of reporting.

Items that appear on the checklist presented here are in many cases also described elsewhere on the ENA website as part of read domain XML schema descriptions and further documentation. Of those items listed as mandatory, many are formally described in our schema documentation, but are included here to make the checklist a single point of reference for epigenomics data submissions. A small number of fields are not formally required in our schemas, but have been developed through our experience of capturing data from epigenomics studies and through observation of community practise, not least that of the ENCODE consortium.

We present this checklist as a living document that we expect over time will be edited and updated according to emerging methods and practises and community feedback, which we welcome at datasubs@ebi.ac.uk.

Both interactive and programmatic tools are available to aid in the submission of epigenomics data to ENA. For those using the interactive submission tool, Webin, the checklist presented here should serve as an indication of the information that needs to be to hand to complete the submission. For users of the programmatic submission tool, a mapping between checklist items and XML elements are presented at the end of this page. Programmatic users should also note that a small number of additional mandatory elements, described in the XML schemas, may be required. Access and further details about all submission types are available here and submission enquiries and requests for assistance are welcome at datasubs@ebi.ac.uk.

Minimal information checklist

The checklist contains mandatory, recommended and optional fields and is subdivided into four categories:

Checklist fields for study

Field Description
Mandatory fields
M-1. Study title A short informative title of the epigenomics study; typically akin to a publication title.
M-2. Investigator name The name of the primary investigator.
M-3. Investigator e-mail The e-mail address for the primary investigator.
M-4. Center name The name of the centre in which the primary investigator has worked.
M-5. Study description A detailed description of the study; typically akin to a publication abstract.
Recommended fields
R-1. Study type The type of the study; expected to be Epigenetics.
Optional fields
O-1. Release date The intended release date from pre-publication confidentiality.

Checklist fields for sample and sample processing

Field Description
Mandatory fields
M-1. Taxonomic identifier Species or infraspecies taxonomic name of the sampled organism or the taxonomic identifier taken from the NCBI Taxonomy. More information about taxonomy search and browsing is available here.
M-2. Strain name Strain name of the sampled organism, for prokaryotes or samples from Mus musculus.
M-3. Cell line Name of the cell line, if used.
Recommended fields
R-1. Organ or tissue source Organ or tissue source of the sampled material.
R-2. Epitope tag Details of epitope tagging approach, if used, including nature of tagged gene promoter and expression level.
R-3. Cell line growth conditions Cell line growth conditions and characteristics, such as passage number, observed doubling time and density.
R-4. Physical sample source Physical source of sample, such as stock centre and germplasm collection.
Optional fields
O-1. Phenotype attributes Phenotypic attributes of the sampled organism of relevance to the study.

Checklist fields for sequencing library

Field Description
Mandatory fields
M-1. Experimental design description A brief experimental design description.
M-2. Epigenomics method The epigenomics method that has been used, such as ChIP-Seq and MeDIP-Seq.
M-3. Library source The library source; expected to be genomic.
M-4. Library selection The method of library selection, such as 5-methylcytidine antibody and ChIP.
M-5. Antibody name Antibody name, if used.
M-6. Library layout The library layout; expected to be unpaired reads.
M-7. Platform/Model Sequencing vendor platform and instrument model, such as Illumina HiSeq 2000 or AB SOLiD 5500.
Recommended fields
R-1. Post amplification validation Description of post-amplification validation steps to ensure unbiased representation.
R-2. Antibody lot number The antibody lot number.
R-3. Antibody provider The source of the antibody.

Checklist fields for sequencing data

Field Description
Mandatory fields
M-1. Data files Fastq-formatted data files or aligned BAM files, in which case the ENA accession for the reference sequence should be indicated within the BAM file.
M-2. MD5 checksum MD5 checksum for each data file.

Checklist mapping to XML

Checklist information is inserted where possible into specific XML fields. Where specific XML fields do not exist, the schema for each XML object supports extensions using a TAG:VALUE system that appears under STUDY_ATTRIBUTES, SAMPLE_ATTRIBUTES, EXPERIMENT_ATTRIBUTES and RUN_ATTRIBUTES. The TAG:VALUE pairs are used to capture remaining checkist information using intuitively understandable TAG names listed below. For URLs to further information, these should be inserted into the STUDY_LINKS, SAMPLE_LINKS, EXPERIMENT_LINKS, RUN_LINKS structure of the respective XML object. Please refer to the schema documentation for further details.

Field mappings for study

Field Mapping to XML
Mandatory fields
M-1. Study title /STUDY_SET/STUDY/DESCRIPTOR/STUDY_TITLE
M-2. Investigator name /STUDY_SET/STUDY/STUDY_ATTRIBUTES/STUDY_ATTRIBUTE/TAG[Investigator name]
M-3. Investigator e-mail /STUDY_SET/STUDY/STUDY_ATTRIBUTES/STUDY_ATTRIBUTE/TAG[Investigator email]
M-4. Center name /STUDY_SET/STUDY/@center_name
M-5. Study description /STUDY_SET/STUDY/DESCRIPTOR/STUDY_ABSTRACT
Recommended fields
R-1. Study type /STUDY_SET/STUDY/DESCRIPTOR/STUDY_TYPE/@existing_study_type
Optional fields
O-1. Release date /SUBMISSION/ACTIONS/ACTION/HOLD

Field mappings for sample and sample processing

Field Mapping to XML
Mandatory fields
M-1. Taxonomic identifier /SAMPLE_SET/SAMPLE/SAMPLE_NAME/TAXON_ID
M-2. Strain name /SAMPLE_SET/SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG[Strain]
M-3. Cell line /SAMPLE_SET/SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG[Cell line]
Recommended fields
R-1.Organ or tissue source /SAMPLE_SET/SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG[Tissue type]
R-2. Epitope tag /SAMPLE_SET/SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG[Epitope tag]
R-3. Cell line growth conditions /SAMPLE_SET/SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG[Cell growth]
R-4. Physical sample source /SAMPLE_SET/SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG[Sample source]
Optional fields
O-1. Phenotypic attributes /SAMPLE_SET/SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG[Phenotype:<name of phenotype>]

Field mappings for sequencing library

Field Mapping to XML
Mandatory fields
M-1. Experimental design description /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG[Experimental design]
will also be copied to:
/EXPERIMENT_SET/EXPERIMENT/DESIGN/DESIGN_DESCRIPTION
M-2. Epigenomics method /EXPERIMENT_SET/EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_STRATEGY
M-3. Library source /EXPERIMENT_SET/EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_SOURCE
M-4. Library selection /EXPERIMENT_SET/EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_SELECTION
M-5. Antibody name /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG[Antibody]
M-6. Library layout /EXPERIMENT_SET/EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_LAYOUT
M-7. Platform/Model /EXPERIMENT_SET/EXPERIMENT/PLATFORM
Recommended fields
R-1. Post amplification validation /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG[Amplification validation]
R-2. Antibody lot number /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG[Antibody lot]
R-3. Antibody provider /EXPERIMENT_SET/EXPERIMENT/EXPERIMENT_ATTRIBUTES/EXPERIMENT_ATTRIBUTE/TAG[Antobody provider]

Field mappings for sequencing data

Field Mapping to XML
Mandatory fields
M-1. Data files /RUN_SET/RUN/DATA_BLOCK/FILES/FILE/@filename, @filetype
M-2. MD5 checksum /RUN_SET/RUN/DATA_BLOCK/FILES/FILE/@checksum, @checksum_method