Prepare XMLs

Sequence submissions consists of metadata XML documents and read data files.

The metadata XMLs can be created directly and submitted using the programmatic submission service. 

The goal of this page is to provide sufficient information to submitters to be able to create the metadata XML documents required for programmatic submissions.

Latest XML schemas for metadata objects available here (1.5)

Please note, the EGA utilises the xml schemas maintained at the European Nucleotide Archive (ENA).

 

Please find examples of the metadata XML documents below. Any questions should be directed to ega-helpdesk@ebi.ac.uk.

 

Raw data submissions

A typical raw (unaligned) sequence read submission consists of 8 XMLs: Submission, Study, Sample, Experiment, Run, DAC, Policy and Dataset XML.

A submission does not have to contain all eight XMLs. For example, it is possible to submit only samples or a study to be referenced in the future.

Please note that whatever the submission scenario, you will always require the following:

  1. A Submission XML to describe the action (ADD/VALIDATE) and to specify an EGA submisison (PROTECT)
  2. The Dataset XML submitted as a separate transaction after the run accessions are obtained as a result of the submission of a Run XML.

When technical reads (e.g. barcodes, adaptors or linkers) are included in the submitted raw sequences a spot descriptor must be submitted to describe the position of the technical reads so that they can be removed. The following data files can be submitted without providing spot descriptor information in the experiment/run XML:

  • BAM files (single reads)
  • SFF files (single reads without barcodes)
  • Fastq files (single reads without any technical reads)
 

Analysis sequence submissions

A typical EGA analysis data submission consists of 7 EGA XMLs: Submission, Study, Sample, Analysis, DAC, policy and Dataset XML.

A submission does not have to contain all 7 XMLs. For example, it is possible to submit only samples or a study to be referenced in the future.

Please note that whatever the submission scenario, you will always require the following:

  1. A Submission XML to describe the action (ADD/VALIDATE) and to specify an EGA submisison (PROTECT)
  2. The Dataset XML is submitted as a separate transaction after the analysis accessions are obtained as a result of the submission of a Analysis XML.

We accept three different types of analysis data submissions:

  • BAM files (for multiple read alignments)
  • VCF files (for sequence variations)
  • Phenotype files (in any format)

     

In all cases samples must be created to refer to the samples used within the BAM, VCF and phenotype files.

 

Identifying objects

Every object is uniquely identified within the submission account using the alias attribute.

Once an object has been submitted, no other object of the same type can use the same alias within the submission account. 

The aliases are used in submissions to make references between different objects. One object references another object's alias using the refname attribute.

For example, if a sample has the alias "sample1", an experiment can reference to this sample by using refname="sample1". 

 

Identifying submitters

The center_name attribute defines the submitting institution.

The center names are controlled acronyms provided to the account holders when the account is first generated for an institute. 

If the submitter is brokering a submission for another institute, the center name should reflect the institute where the data was generated.

If the sequencing has been contracted to another partly, the run_center or analysis_center attributes can be used to provide this information. 

 

Example XMLs 

 

Submission XML

The submission XML is used to validate, submit or update any number of other objects. The submission XML refers to other XMLs.  It is not possible to set a release or hold date for EGA submissions.  EGA studies are not released unless the authorised submission contact named on the submission statements instructs ega-helpdesk@ebi.ac.uk to release the study and all associated datasets.  The EGA does not release a study unless it has associated dataset/s.

New submissions use the ADD action to submit new objects. Object updates are done using the MODIFY action and objects can be validated using the VERIFY action.

Once a submission has been released it can be withdrawn from access only by contacting us at ega-helpdesk@ebi.ac.uk.

Download submission XML example.

 

Study XML 

The study XML is used to describe the study in some detail. The study contains a title, a study type and an abstract as it would appear in a publication.

 

Download study XML example

 

Please use the following notation when including PubMed citations in Study XML:

<STUDY_LINKS>
    <STUDY_LINK>
        <XREF_LINK>
            <DB>PUBMED</DB>
            <ID>18987735</ID>
        </XREF_LINK>
    </STUDY_LINK>
</STUDY_LINKS>


Sample XML

The sample XML is used to describe the sequenced samples. The mandatory fields include information about the taxonomy of the sample, gender, subject_id  and phenotype.

Mandatory attribute fields for each sample:

 </SAMPLE_ATTRIBUTE>

        <TAG>gender</TAG>

        <VALUE>female/male/unknown</VALUE>

   </SAMPLE_ATTRIBUTE>

   <SAMPLE_ATTRIBUTE>

        <TAG>phenotype</TAG>

        <VALUE>Free text, EFO term recommended</VALUE>

   </SAMPLE_ATTRIBUTE>

   <SAMPLE_ATTRIBUTE>

        <TAG>subject_id</TAG>

        <VALUE>FREE TEXT</VALUE>

   </SAMPLE_ATTRIBUTE>

Sample is one of the most important objects to be described biologically, it is highly recommended that “TAG-VALUE” pairs are generated to describe the sample in as much detail as you can. 

Where possible, use the Experimental Factor Ontology (EFO) to describe your phenotypes.  Phenotypes considered essential for understanding the data submission should be provided.  Each phenotype described should be listed as a separate sample attribute <SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTE>.  There is no limit to the number of phenotypes that can be submitted.

If a suitable EFO accession cannot be found for your phenotype attribute, please consider using another controlled ontology database before using free text.

Download sample XML example

 

Experiment XML

The experiment XML is used to describe the experimental setup including instrument platform and model details, library preparation details, and any additional information required to correctly interpret the submitted data. Where any of these values differ between runs, a new experiment object must exist. Each experiment references a study and a sample by alias, or if previously-submitted, by accession. Pooled data must be demultiplexed by barcode for submission.

Download experiment XML example (Illumina single read)
Download experiment XML example (Illumina paired read)
Download experiment XML example (454 unpooled single reads, SFF files)
Download experiment XML example (454 unpooled paired reads, SFF files)
Download experiment XML example (454 pooled single reads)

Download experiment XML example (Complete Genomics)

 

Run XML

The run XML is used to associate data files with experiments and typically comprises of a single data file. Please note that pooled sampled should be de-multiplexed prior submission and submitted as different runs.

Download run XML example

Download run XML Complete Genomics example

 

Read alignment (BAM) Analysis XML

The Analysis can be used to submit BAM alignments to EGA. Only one BAM file can be submitted in each analysis and the samples used within the BAM read groups must be associated with Samples. In addition, the Analysis must be associated with a Study. Optimally the BAM file would be associated with an INSDC reference assembly and sequences  either by using accessions  (as for the references sequences in the example below) or by using commonly used labels (as for the reference assembly in the example below). The BAM index can be submitted together with the BAM. If the BAM index file is not submitted then it will be created by EGA. The md5 checksums for the .bam and .bai files can be provided within the Analysis XML or in files .bam.md5, .bam.gpg.md5 and .bai.md5, .bai.gpg.md5

Download analysis XML (BAM alignments)


Sequence variation (VCF) Analysis XML

The Analysis can be used to submit VCF files to EGA. Only one VCF file can be submitted in each analysis and the samples used within the VCF files must be associated with Samples. In addition, the Analysis must be associated with a Study. Optimally the VCF file would be associated with an INSDC reference assembly and sequences either by using accessions (as for the references sequences in the example below) or by using commonly used labels (as for the reference assembly in the example below). The md5 checksums for the .vcf file can be provided within the Analysis XML or in files .vcf.md5, .vcf.gpg.md5.

Download analysis XML (VCF)

 

Phenotype files

The Analysis XML can be used to submit phenotype files to the EGA. Only one phenotype file can be submitted in each analysis and the samples used within the phenotype files must be associated with EGA Samples. In addition, the EGA Analysis must be associated with a EGA Study. The md5 checksums for the phenotype file can be provided within the Analysis XML.  Md5sum values DO NOT need to be provided in the xml if the EGA Webin Data Uploader tool has been used to upload files.

Download analysis XML (Phenotype)

 

DAC XML

The DAC XML describes the Data Access Committee (DAC) affiliated to the data submission.  The DAC may consist of a group or a single individual and is responsible for the data access decisions based on the application procedure described in the POLICY.XML.

The DAC is typically formed from the same organization that monitored the collection and analyses of the data or a designate of this organization.

Users apply directly to the DAC for data access.  Once directly instructed by the DAC, the EGA will then provide secure access to the data through individual EGA accounts.

A DAC XML does not need to be provided if your submission is affiliated to an existing EGA DAC.  Please see here for the current list of DACs at EGA. 

Further information on DAC's can be found here.

Download DAC XML example.

 

Policy XML

The Policy XML describes the Data Access Agreement (DAA) to be affiliated to the named Data Access Committee.   Examples of Data Access Agreements can be found here.

Download policy XML example

 

Dataset XML

The dataset XML describes the data files, defined by the Run.XML and Analysis.XML, that make up the dataset and links the collection of data files to a specified Policy.  The dataset xml is the final metadata object to be submitted.

Please consider the number of datasets that your submission consists of, for example, a case control study is likely to consist of at least two datasets.  In addition, we suggest that multiple datasets should be described for studies using the same samples but different sequence technologies.  Please contact EGA Helpdesk for further assistance.

Download dataset XML 

 

What happens after the submission of a dataset xml?

Once you have completed the registration of your dataset/s please contact the ega-helpdesk@ebi.ac.uk to provide a release date for your study.  Datasets affiliated to existing studies that have already been released should automatically be released.  

Please note that all datasets affiliated to unreleased studies are automatically placed on hold until the authorised submitter or DAC contact instructs our ega-helpdesk@ebi.ac.uk for the study to be released.

When your study progresses to our live site the named DAC contacts will be provided access to the EGA DAC admin tools  to create and manage EGA accounts with access permissions to the dataset/s affiliated to the study.

Further information regarding the role of the Data Access Committee can be found here

Finally, your data is archived within our databases and prepared for encrypted distribution upon the request of permitted EGA account holders.

We strongly advise you NOT to delete your data until we confirm that your data has been successfully archived.