Sequencing Submissions

1. Types of data that can be submitted
2. What data files to submit, and how
3. BAM file requirements
4. Example submissions
5. More help on sequencing submissions


1. Types of data that can be submitted

ArrayExpress accepts submissions of non-human and human non-identifiable functional genomics data generated using high throughput sequencing (HTS) assays like RNA-seq and ChIP-seq. To submit to ArrayExpress, all you need to do is fill in a simple spreadsheet (easily editable in any spreadsheet program, e.g. Microsoft Excel) and transfer your raw data files to us. Submissions without raw data files will not be accepted.

The meta-data about your experiment (e.g. experiment description, sample annotation, wet- and dry-lab protocols) will be stored at ArrayExpress, and the raw data files (e.g. fastq files) are eventually stored at the Sequence Read Archive (SRA) of the European Nucleotide Archive (ENA). ArrayExpress will transfer the raw data files to the ENA for you so you do not need to submit those files separately to the ENA. You can also send us processed data (i.e. processed from the raw reads, e.g. BAM alignment files, differential expression data, expression values linked to genome coordinates, etc). Depending on the file format, it will either be stored at ArrayExpress or the ENA.

 

If you have human potentially-identifiable sequencing data you need to submit to the European Genome-phenome Archive (EGA) and not ArrayExpress. They will supply you with a template for submission and store human identifiable data securely. They will then pass the non-identifiable data to us as shown in the diagram below.

 

Diagram of data submission routes

Diagram of the submission of different sequencing data types to ArrayExpress or the European Genome-phenome archive.

Top

 

2. What data files to submit, and how

For your experiment's meta-data, it should be prepared in MAGE-TAB format in the spreadsheet. You don't have to construct the spreadsheet from scratch: go to the MAGE-TAB submission tool and use its template-generation software to create a template spreadsheet tailored for your experiment. While creating the template spreadsheet, make sure you've ticked the "UHTS experiment" box. Take your time to fill in the spreadsheet offline, and upload it using the same tool when you are ready. To find out more about the MAGE-TAB format and how to fill in the spreadsheet, please take a look at the MAGE-TAB help page or this interactive tutorial: ArrayExpress:Submitting data using MAGE-TAB.

For raw data files, please prepare them according to ENA specifications (e.g. each individual fastq file should be compressed by gzip or bzip2). Data files which do not satisfy ENA's requirements will not be accepted. This is a developing field and the specifications are updated regularly, so please do check them every time you submit to us. If you are submitting BAM files as raw data files, please read this important documentation on BAM file specification.

As sequence files tend to be very large, we implement a file integrity validation step before sending the files to the ENA on your behalf. For each sequence file that you transfer to ArrayExpress, please calculate its MD5 checksum and enter the value in the "Comment[MD5]" column of the SDRF section of your MAGE-TAB spreadsheet. We need MD5 checksums because they act as the files' digital "fingerprints" and it is very unlikely that two non-identical files would generate the same "fingerprint". The checksum allows us to verify that each file has not been corrupted during FTP transfer. For fastq files, please calculate the checksum from the actual compressed file (e.g. fastq.gz or fastq.bz2) that is sent to us, not from the uncompressed fastq file or a re-compressed version created at a later time. How to calculate MD5 checksums: Windows user example, Mac user example, Linux user example.

Once you have the files ready, send them to ArrayExpress by FTP. You can either upload the files one by one to the FTP site, or upload a single archive (e.g. .tar.gz file). (If you upload a single archive, please make sure the sequence files are not "hidden" in a multi-layer directory structure as it will be very tedious to extract constituent files.) When the FTP transfer has finished, please email us at miamexpress@ebi.ac.uk to tell us about the files so that we can retrieve the files from the FTP site.

For processed data files, make sure the file(s) are bundled in one or more compressed archives (e.g. .zip, .gz). If the compressed archive file is under 5MB in size, you can upload them using the MAGE-TAB submission tool alongside your MAGE-TAB spreadsheet. For any files larger than 5MB, please send them to ArrayExpress by FTP. Again, please email us at miamexpress@ebi.ac.uk when the transfer has finished, so that we can go and retrieve the files from the FTP site.

If you have both sequencing and microarray data related to the same study, you can either submit them as one single ArrayExpress experiment (especially if you used the same biological samples for both data sets), or separate ArrayExpress experiments:

  • When creating a single submission, you will have one "IDF" (Investigation description format) file describing your study, contact details and all wet-lab/dry-lab protocols used, which is then linked to multiple "SDRFs" (Sample and Data Relationship format) files describing the samples used and the data files they're associated with, e.g. one SDRF for sequencing data and another SDRF for microarray data. See experiment E-GEOD-32120 as an example.
  • If you would prefer to submit your data as separate ArrayExpress experiments, that is fine too. Just email us at miamexpress@ebi.ac.uk and let us know that your submissions are related, and we will cross-reference them in their description fields (e.g. "Data from an accompanying microarray experiment has also been deposited at ArrayExpress under accession xxxx"). See experiments E-MTAB-1432 and E-MTAB-1433 as examples.

Top

 

3. BAM data file requirements

If you're submitting raw data files in BAM format, please make sure they satisfy ENA specifications as well as the following two conditions:

  1. Each file contains all reads from the sequencing machine, regardless of whether the reads mapped to the reference genome. The reason for this is that we would expect the BAM file to be used to regenerate all the sequencing reads, provided we are armed with information about the reference genome sequence used for generating the alignment in the first place.
  2. The alignment in the BAM file was generated against a reference genome accessioned in the International Nucleotide Sequence Database Collaboration (INSDC, involving DDBJ, ENA, and GenBank).

If your BAM files contain only mapped reads, then please either create "full", unfiltered BAM files (where both mapped and unmapped reads are present), or send us the original read files as raw data files (again, check ENA specifications).

BAM files containing only mapped reads can be included in your submission as processed files, as long as they satisfy ENA's specification and that the reference genome used for alignment has been accessioned in INSDC.

 

Top

4. Example submissions

All examples are taken from public experiments on the ArrayExpress website. For each of the examples below, there is a link to the experiment's frontpage, where you will find all the meta-data and curated spreadsheet (idf.txt / sdrf.txt) in the "Files" section to download. The sample-date relationship table shows you how the SDRF component of the MAGE-TAB meta-data spreadsheet was structured.

RNA-seq, Illumina Genome Analyzer IIx platform, single-end

E-MTAB-997 experiment page, Sample-data relationship table

RNA-seq example, Illumina Genome Analyzer II platform, paired-end

E-MTAB-1091 experiment page, Sample-data relationship table

ChIP-seq, AB SOLiD System, single-end

E-MTAB-830 experiment page, Sample-data relationship table

DNA-seq, AB SOLiD System 3.0 platform, paired-end

E-MTAB-1082 experiment page, Sample-data relationship table

ChIP-seq, Illumina HiSeq 2000 platform, single-end

E-MTAB-1084 experiment page, Sample-data relationship table

 

Top

5. More help on sequencing submissions

The EBI Train Online bite-size tutorial on submitting data using MAGE-TAB contains lots of useful information on how to prepare your submission. Two sub-sections deal specifically with high-throughput sequencing submissions: Submisssion of HTS data and HTS Submission Library Terms (a useful glossary when you are filling in the MAGE-TAB spreadsheet).

If your questions remain unanswered, drop us an email at miamexpress@ebi.ac.uk. Don't forget to include a brief description of what you're trying to do, e.g. which sequencing platform you're using, whether it's a single-end or paired-end experiment, or specific error messages you've encountered while using the MAGE-TAB submission tool or FTP.

Top