In this page, we provide a detailed view of validation tests and responsive actions taken for data routed into the European Nucleotide Archive (ENA). While technical implementations are mentioned, our focus here is on validation concepts. Our intention is to provide a report of current specifications that we hope will be useful externally, both for those wishing to improve their procedures for generating and submitting data and for those consumers of archived data with a need to understand the nature and reliability of components of the data.
Just as sequencing technology advances, so must the nature of the validations and improvements that must be applied to data. As a result, the ENA is under an ongoing programme of technology development that provides new validation steps, more sophisticated responses to validation failures and appropriate treatments of new data types. This report, then, provides a snapshot of an evolving scheme with outlines of future functionalities as appropriate.
This page has been prepared as part of work supported under the ESGI project.
A validation is defined in this text as a test against a set of pre-defined criteria or rules. Validation includes syntactic and semantic tests and, in cases, operates across a number of data objects and structured vocabularies.
In this text, a responsive action is set of content-related changes to data objects that are applied automatically as a consequence of the output of a validation test or a report fed to submitters or ENA staff for consideration and subsequent action in cases where human involvement is necessary.
Layout of validation tests and responses across ENA
Figure 1 shows the stages of validation and responsive actions laid out roughly in the order in which the stages are applied. The scheme is annotated with details of the user environment at the time of the validation. Further details of submission and update environments are available from here. There are typically three submission routings through the validation stages, according to user choice between programmatic, interactive and semi-programmatic submissions, these last typically representing use cases where sample information is entered by one party and data flow through an institutional centralised data submission service. For updates to existing data, there are typically only two entirely interactive and entirely programmatic routings. It should be noted that regardless of the submission environment chosen (interactive or programmatic), validation tests are closely related or identical while responsive actions may differ according to environment, particularly in early stages.
Figure 1: Overall layout of validation stages with superimposed interactive, programmatic and semi-programmatic submission and update routes
The checklist concept
A checklist in ENA represents a set of additional fields used consistently across a number of objects, typically samples. Checklists have been introduced to support rich and diverse structured annotation across the objects deposited in ENA. Checklists are used to inform ENA’s submission/update applications, validation processes, indexing and presentation services. Checklists are referred to in this text in relation to their submission and validation functions. Checklists are accessioned, versioned and planned to be made available soon in the ENA browser.
ENA high-level read schema
The read and analysis data schema consists of a number of objects interconnected relationally according to the schema shown in Figure 2.
Figure 2: ENA high-level schema for read and analysis data. Metadata objects are shown in green and data objects in orange.
While metadata objects are communicated in XML formats, data are typically communicated in BAM, CRAM and Fastq formats. The following objects comprise this high-level schema:
Study: A study groups together data submitted to the archive. Please use the study accession number when citing data submitted into ENA.
Submission: A submission contains submission actions to be performed by the archive. A submission can add more objects to the archive, update already submitted objects or make objects publicly available. Programmatic submitters can also use the submission to validate objects before they are submitted to the archive.
Sample: A sample contains information about the sequenced samples. Samples are associated with checklists, which define the attributes used to annotate the samples and experiments or analysis objects.
Experiment: An experiment contains information about the sequencing experiments, including library and instrument details.
Run: Runs are part of experiments and contain sequencing reads submitted in data files (e.g. BAM and CRAM). Each run can contain all or part of the results for a particular experiment.
Analysis: An analysis contains secondary analysis results computed from primary sequencing results (e.g. VCF containing sequence variations or BAM containing sequence alignments).
Checklist: A checklist contains mandatory and optional attributes used to annotate other objects.
Data: Base/colour calls, per-call quality information, signals and flow information are represented here.
User requests for ENA submission accounts require the fields of information shown in Table 1 to be provided.
Table 1. Submission account fields.
|Further e-mail address(es)||Optional|
This information is currently provided by e-mail to email@example.com and information fields provided are manually checked at ENA prior to the issue of an account. Account credentials are provided by e-mail that include a username and password.
Account creation request and configuration changes, both currently requested through firstname.lastname@example.org and supported manually by ENA staff, require retention of appropriate information in the fields shown in Table 1.
Users are required to present login credentials while using our submission interfaces. For data upload authentication, username and password are required through FTP, Aspera and Webin Data Uploader clients. For interactive submissions, an e-mail address associated with the account and username, or the username and password, are accepted. Programmatic submitters using the REST service undergo authentication using account names and passwords. In cases where a user provides incorrect credentials, this validation failure is reported as appropriate for the interface, and users are required to repeat the authentication step or request help from email@example.com.
Validations for all metadata objects (carrying information relating to samples, experimental configurations and analyses) are shared between interactive, programmatic and semi-programmatic submission routes. Fewer validation errors occur, however, in interactive metadata object submissions as the Webin interface assures syntactic compliance and constrains the user to a limited set of semantic options.
The classes of validation are listed in Table 3.
Table 3. Metadata validation classes.
|Validation target||Interface||Example(s)||Further information|
|Submitter authentication credentials||Webin/REST||Password must match account password on record||See Table 1 for details of specific fields|
|XML structure||Webin/REST||Sample object complies with its schema||See XML schema documentation at *1|
|Existence of referenced objects||Webin/REST||Experiment object must refer to an existing sample object||See XML schema documentation at *1|
|Object name (alias) uniqueness across the submitters objects||Webin/REST||Study name (alias) does not clash with existing names (aliases) within submitters’ existing data sets||See XML schema documentation at *1|
|Correct use of constrained fields||Webin/REST||Library strategy field in experiment object must be from controlled vocabulary||See XML schema documentation at *1 and constrained field vocabularies at *2|
|Existence of mandatory information||Webin/REST||Instrument platform field in Experiment object||See XML schema documentation at *1|
|Existence of mandatory attributes specified by the indicated checklist||Webin||Strain field in sample object where user has indicated compliance with pathogen checklist||See XML schema documentation at *1 and checklist documentation (to be made available from the ENA Browser shortly)|
|Taxonomic information||Webin||Scientific name must exist in ENA taxonomy||See Taxonomy portal REST URLs|
|Presence of uploaded data files referenced by run and analysis objects||Webin/REST||Data filename field in run object must match file uploaded by submitter||See *1|
*2: Library strategy
File upload validations are required to assure that intended files have been transferred and file integrity has been retained during the transfer process.
File upload into ENA is supported under the FTP protocol (choice of many clients for the user), Aspera (Aspera ascp command line client is available from AsperaSoft) and the Webin Data Uploader application (both command line and interactive clients are available).
Authentication requires the submitter’s credentials (username and password). Once authenticated, submitters can upload files to a private data area to be submitted into the archive. If files are uploaded using the Webin Data Uploader, then the integrity of the file transfer is managed and guaranteed by the tool. If FTP or Aspera are used, then the MD5 checksums must be provided with the files for check for the transfer integrity check.
Data and analysis file headers
Currently, no systematic validation is applied between data or analysis file headers (such as in CRAM, BAM and VCF files) and appropriate fields in run or analysis objects, respectively. Limited manual checking is, however, applied for the case of analysis files, in which BAM-formatted secondary alignments and VCF-formatted variation data point to appropriate available reference sequences. Automated validations are planned to replace these manual checks. Further validation will also be applied that ensures consistency between sample objects (e.g. sample group pointers) and data and analysis files.
Data can be submitted into the archive in a large number of data file formats that include cross-platform formats, which typically each present a number of distinct usages, and platform-specific formats. Data files are validated while they are being processed into standard data products for presentation (currently Fastq, in future CRAM), typically at a point in time after interaction with the submitter is complete for the submission in question. Cross-platform format validation classes are described in table IV and platform-specific validation is described in table V.
Table 4. Cross-platform data file validation.
|Data file format||Validation||Notes||Further information|
|Fastq||Quality score offset and scale identification; For SOLiD data only, colour space usage is confirmed||While the majority of submitted Fastq files now use Phred-scale, data are still submitted using alternative scales.||See *1|
|BAM||BAM validation rules (relaxed)||An example of a relaxed rule relates to conflicting header and read information||See http://samtools.sourceforge.net/ for BAM documentation|
|CRAM||CRAM validation rules||File must be readable by CRAMtools||See http://wwwdev.ebi.ac.uk/ena/about/cram_toolkit/|
|SRF||SRF validation rules||We expect the SRF files submissions to be phased out in the near future.|
|SFF||Rules relating to extraction of biological from technical reads based on heuristics developed at ENA|
Table 5. Platform-specific data file format validation
|Complete Genomics||Data folder content validated against the manifest file|
|Pacific Biosciences||Existence of expected content validated