Read domain XML 1.5 metadata format

The SRA 1.5 XML replaced SRA 1.4 XML on 3rd of March 2013. A large number of deprecated fields was removed from SRA 1.5 XML (for more information please see below). The Analysis XML and Dataset XML have undergone a number of additional changes on the 15th of July 2013 (for more information see below). A number of additional changes are planned for Project XML. This will be followed by the deprecation of Study XML as a submission format in favour of the Project XML (for more information please see below).

If you have any questions of concerns about the change please contact datasubs@ebi.ac.uk.

 

XML Schema Description
SRA.submission.xsd A submission action to be performed by the archive.
SRA.sample.xsd Detailed information about the sequenced sample. Samples can be used in any number of experiments.
SRA.study.xsd A study groups together experiments or analyses for public data release purposes.
SRA.experiment.xsd An experiment contains instrument and library preparation information and groups together one or more runs.
SRA.run.xsd

A run contains sequencing reads submitted in data files (e.g. BAM or CRAM).

SRA.analysis.xsd An analysis contains secondary analysis results. for example: read alignments (BAM or CRAM), sequence variations (VCF) of sequence annotations (TAB).
SRA.common.xsd Common types used in other SRA XML schemas.
EGA.dac.xsd An European Genome-phenome Archive (EGA) data access committee (DAC). Required for authorized access submissions.
EGA.policy.xsd An European Genome-phenome Archive (EGA) data access policy. Required for authorized access submissions.
EGA.dataset.xsd An European Genome-phenome Archive (EGA) data set. Required for authorized access submissions.

Proposed changes to Project XML

  • Replaced COLLABORATORS element with ORGANIZATIONS. An organisation can be either owner or participant. This structure is more conveniently exchanged with our INSDC partners.
  • Added GRANTS element with AGENCY, IDENTIFIER and optional TITLE details.
  • Removed UNSTRUCTURED_CITATION element and renamed STRUCTURED_CITATION element to CITATION.
  • Removed AUTHORS element and moved AUTHOR and CONSORTIUM to appear under CITATION.
  • Removed 'material' attribute from SUBMISSION_PROJECT. This information should be tracked on the experiment level rather than on the study level.
  • Removed 'selection' attribute from SUBMISSION_PROJECT. This information should be tracked on the experiment level rather than on the study level.
  • Removed SEQUENCING_PROJECT element and moved LOCUS_TAG_PREFIX to appear under SUBMISSION_PROJECT.
  • Removed OBJECTIVE element.
  • Removed RELATED_CHROMOSOMES element.

Replacement of Study XML by Project XML

The Study XML will be deprecated as a submission format for programmatic submitters by the end of 2013. All interactive Webin submitters have already been migrated to use Project XML rather than Study XML.

Field mappings between Study XML to Project XML are:

  • STUDY_SET and STUDY will appear as PROJECT_SET and PROJECT elements.
  • 'alias', 'center_name', 'broker_name' and 'accession' attributes in STUDY element will appear under PROJECT element.
  • IDENTIFIERS element in STUDY will be appear under PROJECT element.
  • STUDY_TITLE will appear as PROJECT/TITLE element.
  • STUDY_ABSTRACT and STUDY_DESCRIPTION will appear as PROJECT/DESCRIPTION element. These two fields are merged into one.
  • STUDY_TYPE field may not appear as part of Project XML. This field will be either deprecated or kept with an improved vocabularly (under discussion).
  • CENTER_NAME element will not appear as part of Project XML. This field will be deprecated.
  • CENTER_PROJECT_NAME element will not appear as part of Project XML. This field will be deprecated.
  • PROJECT_ID element will not appear as part of Project XML. This field will be deprecated.
  • RELATED_STUDIES element will appear as RELATED_PROJECTS element.
  • STUDY_LINK element will appear as PROJECT_LINK element.
  • STUDY_ATTRIBUTES element will appear as PROJECT_ATTRIBUTES element.
  • There are two types of projects: SUBMISSION_PROJECT and UMBRELLA_PROJECT. Only SUBMISSION_PROJECT can directly contain data and corresponds to the study XML.

Additional fields in Project XML are:

  • PUBLICATIONS: citation references to Pubmed, Pubmed Central and DOI and support for collecting pre-publication citation information.
  • GRANTS: support for collecting grant information.
  • ORGANIZATIONS: support for collecting project owner and other participant details.
  • UMBRELLA_PROJECT: support for grouping other studies together.

Changes to Analysis XML on 15 July 2013

New schema effective from 15 July 2013: SRA.analysis.xsd

General changes

  • CHECKLIST element now supports IDENTIFIERS element.

SEQUENCE_ANNOTATION analysis type

  • Added 'wig', 'bed' and 'gff' filetypes to be used only for SEQUENCE_ANNOTATION.
  • The files must be gzipped before submission.
  • The file suffixes must be .bed.gz, .wig.gz, and .gff.gz.

SEQUENCE_VARIATION analysis type

  • Added PROGRAM, PLATFORM, IMPUTATION, EXPERIMENT_TYPE elements.
  • Allowed valued for EXPERIMENT_TYPE are:
    • Whole genome sequencing
    • Exome sequencing
    • Genotyping by array
    • Curation
  • Added 'readme_file' filetype used for SEQUENCE_VARIATION and REFERENCE_ALIGNMENT.
  • Added 'vcf_aggregate' filetype used only for SEQUENCE_VARIATION.
  • Added 'tabix' filetype used only for SEQUENCE_VARIATION.
  • File suffix for ‘vcf’ and 'vcf_aggregate' filetypes must be '.vcf.gz'.
  • File suffix for ‘tabix’ filetype must be '.tbi'.
  • Only one file of type ‘vcf’ or 'vcf_aggregate' is allowed in an analysis.
  • Added 'other' filetype used only for SEQUENCE_VARIATION.
  • Any number of files with ‘other’ filetype are allowed in an analysis.

REFERENCE_ALIGNMENT analysis type

  • Added 'readme_file' filetype to be used only for SEQUENCE_VARIATION and REFERENCE_ALIGNMENT.

SAMPLE_PHENOTYPE analysis type

  • Added new analysis type: SAMPLE_PHENOTYPE.
  • Only allowed for EGA submissions.
  • Added 'phenotype_file' filetype to be used only for SAMPLE_PHENOTYPE.
  • Each SAMPLE_PHENOTYPE analysis must have one 'phenotype_file'.

SEQUENCE_ASSEMBLY analysis type

  • Added new analysis type: SEQUENCE_ASSEMBLY.
  • Added NAME, PARTIAL, COVERAGE, PROGRAM, PLATFORM, MIN_GAP_LENGTH elements.
  • Added the following filetypes to be used only for SEQUENCE_ASSEMBLY:
    • contig_fasta (0 or 1 occurrences)
    • contig_flatfile (0 or 1 occurrences)
    • scaffold_fasta (0 or 1 occurrences)
    • scaffold_flatfile (0 or 1 occurrences)
    • scaffold_agp (0 or 1 occurrences)
    • chromosome_fasta (0 or 1 occurrences)
    • chromosome_flatfile (0 or 1 occurrences)
    • chromosome_agp (0 or 1 occurrences)
    • chromosome_list (0 or 1 occurrences)
    • unlocalised_contig_list (0 or 1 occurrences)
    • unlocalised_scaffold_list (0 or 1 occurrences)
  • The following file types are mutually exclusive:
    • contig_fasta and contig_flatfile
    • scaffold_fasta and scaffold_agp
    • chromosome_fasta and chromosome_agp
  • unlocalised_contig_list filetype requires contig_fasta or contig_flatfile filetype.
  • unlocalised_scaffold_list filetype requires scaffold_fasta, scaffold_flatfile or scaffold_agp filetype.
  • Each filetype must occur at most once.
  • The files must be gzipped before submission.
  • The file suffix for '*_fasta' filetypes must be .fasta.gz.
  • The file suffix for '*_agp' filetypes must be .agp.gz.

Changes to Dataset XML on 15th of July 2013

New schema effective from 15 July 2013: EGA.dataset.xsd

  • Made DATASET_TYPE mandatory.
  • Allowed values for DATASET_TYPE are:
    • Whole genome sequencing
    • Exome sequencing
    • Genotyping by array
    • Transcriptome profiling by high-throughput sequencing
    • Transcriptome profiling by array
    • Amplicon sequencing
    • Methylation binding domain sequencing
    • Methylation profiling by high-throughput sequencing
    • Phenotype information
    • Study summary information
    • Genomic variant calling

Changes between SRA 1.5 XML and SRA 1.4 XML

SRA Webin submitters were largely unaffected by the change. They will only observe some changes in instrument model, library strategy and library selection choices. Only minor adaptations to programmatic submission pipelines are expected. Please note that the FILES element has now been removed from Submission XML as announced with SRA 1.4 XML. Consequently, the file 'checksum' and 'checksum method' attributes are now mandatory in Run XML (have already been mandatory in the Analysis XML). Also, please note that the library strategy selection in experiment has been made mandatory.

Instrument model changes

  • Removed LS454 instrument model '454 GS FLX Plus'. Please use '454 GS FLX+' instead.
  • Removed ABI_SOLID instrument model 'AB SOLiD 5500'. Please use AB 5500 Genetic Analyzer instead.
  • Removed ABI_SOLID instrument model 'AB SOLiD 5500xl'. Please use AB 5500xl Genetic Analyzer instead.

Library source changes

  • Removed 'NON GENOMIC' library source.

Library strategy changes

  • New library strategy ncRNA-Seq: Capture of other non-coding RNA types, including post-translation modification types such as snRNA (small nuclear RNA) or snoRNA (small nucleolar RNA), or expression regulation types such as siRNA (small interfering RNA) or piRNA/piwi/RNA (piwi-interacting RNA).
  • New library strategy SELEX: Systematic Evolution of Ligands by EXponential.
  • New library strategy RIP-Seq: Direct sequencing of RNA immunoprecipitates (includes CLIP-Seq, HITS-CLIP and PAR-CLI).
  • New library strategy ChIA-PET: Direct sequencing of proximity-ligated chromatin immunoprecipitates.

Library selection changes

  • New library selection 'repeat fractionation': Selection for less repetitive (and more gene rich) sequence through Cot filtration (CF) or other fractionation techniques based on DNA kinetics.
  • New library selection 'repeat fractionation' replaces: CF-S, CF-M, CF-H, CF-T.
  • Corrected library selection 'DNAse' to 'DNase'.

Changes to Submission XSD

  • Removed FILES from submission.
  • Made 'schema' attribute mandatory for MODIFY action.
  • Removed 'target' attribute in MODIFY action.
  • Removed 'HoldForPeriod' attribute from HOLD action.
  • Removed 'CLOSE' action.
  • Removed 'notes' attribute from ADD, MODIFY, SUPPRESS, HOLD, RELEASE, VALIDATE actions.

Changes to Run XSD

  • Made 'checksum_method' attribute mandatory for submitted files.
  • Made 'checksum' attribute mandatory for submitted files.
  • Removed 'GAP_DESCRIPTOR' element.
  • Removed 'DATA_SERIES_LABEL' element (effected at EBI in SRA 1.3 XML).
  • Removed 'illumina_native_fastq' filetype. Use 'fastq' instead.
  • Removed 'name', 'sector', 'region' attributes from DATA_BLOCK (effected at EBI in SRA 1.3 XML).
  • Removed 'total_spots', 'total_reads', 'number_channels', 'format_code', 'serial' attributes form DATA_BLOCK element (effected at EBI in SRA 1.3 XML).
  • Removed 'instrument_name' attribute.
  • Removed already deprecated 'instrument_model' attribute (effected at EBI in SRA 1.3 XML).
  • Removed already deprecated 'run_file' and 'total_data_blocks' attributes (effected at EBI in SRA 1.3 XML).

Changes to Experiment XSD

  • Removed 'expected_number_runs', 'expected_number_spots' and 'expected_number_reads' attributes.
  • Removed 'GAP_DESCRIPTOR' element.
  • Made LIBRARY_STRATEGY attribute mandatory.
  • Removed ORIENTATION attribute from PAIRED.
  • Removed 'NON GENOMIC' library source.
  • New library selection 'repeat fractionation': Selection for less repetitive (and more gene rich) sequence through Cot filtration (CF) or other fractionation techniques based on DNA kinetics
  • Made POOLING_STRATEGY a free text string.
  • Removed BASE_CALLS, QUALITY_SCORES from PROCESSING.

Changes to Common XSD

  • Removed SRA_LINK, DDBJ_LINK, ENA_LINK elements.
  • AttributeType: Made VALUE element optional.
  • SpotDescriptorType: Removed already deprecated 'SPOT_DECODE_METHOD' element.
  • SpotDescriptorType: Removed already deprecated 'NUMBER_OF_READS_PER_SPOT' element.
  • SpotDescriptorType: Removed READ_SPEC/CYCLE_COORD.
  • SpotDescriptorType: Removed ADAPTER_SPEC element.
  • SpotDescriptorType: Removed EXPECTED_BASECALL element (use EXPECTED_BASECALL_TABLE instead).
  • PlatformType: Removed optional SEQUENCE_LENGTH from ILLUMINA and ABI_SOLID platforms.
  • PlatformType: Removed optional FLOW_COUNT from LS454, HELICOS platforms.
  • PlatformType: Removed deprecated CYCLE_COUNT element from ILLUMINA and ABI_SOLID platforms.
  • PlatformType: Removed optional CYCLE_SEQUENCE element from ILLUMINA platform. 
  • PlatformType: Removed optional KEY_SEQUENCE and FLOW_SEQUENCE element from LS454 platform and FLOW_SEQUENCE from HELICOS platform.
  • PlatformType: Removed optional COLOR_MATRIX and COLOR_MATRIX_CODE from ABI_SOLID platform.
  • Removed 'none' COMPLETE_GENOMICS instrument model.
  • Removed 'none' PACBIO_SMRT instrument model.

Changes to SRA XML Schema on 11th of August 2014

SRA.experiment.xsd

Added the following:

  • new platform OXFORD_NANOPORE
  • 'MinION' instrument_model for platform OXFORD_NANOPORE
  • 'GridION' as instrument_model for platform OXFORD_NANOPORE
  • 'HiSeq X Ten' as instrument_model for platform ILLUMINA
  • 'NextSeq 500' as instrument_model for platform ILLUMINA
  • New library strategy 'RAD-Seq' : RAD (Restriction site Associated DNA) Sequencing is a method for sampling the genomes of multiple individuals in a population using next generation DNA sequencing
  • New library selection 'Oligo-dT' : Select primarily messenger RNA, which conveniently is polyadenylated so these transcripts can be captured with oligo-dT beads (mRNA-seq)
  • New library selection 'Inverse rRNA selection' : Remove the ribosomal transcripts by inverse selection: you capture them by annealing with specific oligos, also bound to beads, and then discard that (total RNA-seq)

SRA.Analysis.xsd

  • Added EXPERIMENT_REF as a new optional element
  • For REFERENCE_ALIGNMENT and SEQUENCE_VARIATION the ASSEMBLY element is made optional
  • For REFERENCE_ALIGNMENT and SEQUENCE_VARIATION the SEQUENCE element is made optional
  • Added 'transcriptomics' to SEQUENCE_VARIATION/EXPERIMENT_TYPE element

EGA.Dataset.xsd

  • Made DATASET_TYPE optional.
  • Made TITLE mandatory.
  • Added 'Chromatin accessibility profiling by high-throughput sequencing' as DATASET_TYPE
  • Added 'Histone modification profiling by high-throughput sequencing' as DATASET_TYPE

EGA.Dac.xsd

  • Added 'main_contact' as attribute for CONTACT: accepts boolean value

Some of the above changes are reflected in SRA.common.XSD

Latest ENA news

09 Dec 2014: ENA release 122
Release 122 of ENA's assembled/annotated sequences is now available

12 Nov 2014: Simplification of data release procedures
The European Nucleotide Archive will couple the public release of sequence records and the release of study records that contain these sequence records, with immediate effect.

11 Nov 2014: ENA/EMG Sample Record Annotation Workshop
European Nucleotide Archive (ENA) and EBI Metagenomics Portal (EMG), are organising the ENA/EMG Sample Record Annotation Workshop on the 1-5 December 2014 to enrich the environmental sample records.