spacer

Direct submissions of BARCODE data to EMBL Nucleotide Sequence Database

Introduction

Responsibility for presentation of BARCODE sequence and its annotation lies with the member databases of the International Nucleotide Sequence Database Collaboration (INSDC, www.insdc.org), made up of DDBJ, EMBL Nucleotide Sequence Database and GenBank. Submissions of BARCODE data can be routed through INSDC member databases and CBOL.

The EMBL Nucleotide Sequence Database (EMBL-Bank, www.ebi.ac.uk/embl) has as its remit the collection and presentation of all nucleotide sequence and annotation in the public domain. In order to achieve collection, EMBL-Bank provides a range of tools and services to facilitate submission of sequence and annotation data. The web application, Webin (www.ebi.ac.uk/embl/Submission/webin.html), serves as a portal for submission of a wide variety of data types.

Submissions that are received through Webin are subject to rapid processing by the EMBL-Bank curation team. Once all of the data required for completion of processing have been provided, EMBL-Bank returns database accession numbers within 2 working days for small-scale submissions (less than 25 entries) and 5 working days for large-scale submissions. Typically, though, processing is completed in shorter turnaround times than these.

This document describes specific adaptations implemented in Webin that allow rapid and easy submission of data from BARCODING projects. Screen shots are shown as examples of BARCODE submission of mitochondrial cytochrome oxidase subunit I coding region, but submissions will be extended to other BARCODE loci as they are introduced. Multi-locus BARCODE data, where linked sequence from multiple loci is derived from each specimen, are supported.

Custom web forms and file upload

For a variety of large-scale submissions, Webin is able to discriminate between commonalities between entries and fields that vary between entries; for most large-scale studies, such as sequencing of EST libraries, BARCODE and cDNA libraries, there are few fields that vary between entries and many that are common to the entire set.

Using the principle that it is only necessary to recruit common fields once, users are provided with forms and tools to submit a representative sample, indicating which fields will vary between entries (figure I). From this information, a member of the curation team generates a template, which instructs the web application to present custom web forms to the user to upload variable field data (figure II). A web form view and a submission overview (figure III) are available for users to check and edit fields. When the user has completed submission of variable data, the curation team generates the appropriate number of complete database entries for loading and distribution.


Figure I. Submission of representative sample: Pages are provided for submitter information, submission of sequence for representative sample, submission of sample source details, submission of sample citation information, annotation of biological features, flatfile summary and submission of list of fields that will vary between entries. The figure show a) Submitter details, b) Sequence source, c) Literature citation and d) Flatfile summary pages.

a) Fig 1a

b) Fig 1b

c) Fig 1c

d) Fig 1d

Figure II. Custom web forms for variable data upload

Fig 2

Figure III. Submission overview

Fig 3

Pre-determined fields for BARCODE data

For CBOL approval, each BARCODE sequence record must include a number of mandatory fields and a number of fields that are strongly recommended (see tables Ia and Ib). Since this list of fields is consistent between BARCODE entries from all BARCODE projects, EMBL-Bank submission procedures have been adapted such that any requests for web form submission for BARCODE data alert curators to the list of mandatory and recommended fields, such that the submitter need not describe all fields in order for them to be presented in the web forms. Curators will apply mandatory BARCODE fields and suggest recommended BARCODE fields in web forms. BARCODE fields may be common to all entries in a submission or may vary between entries, so where it is not clear from the representative sample submission, curators will liaise with submitters prior to creating the custom web forms.

BARCODE fields


Table Ia. Mandatory fields
Field group Field Description

specimen voucher

centre code

museum/herbarium/stock centre identifier from controlled list

 

collection code

code for collection within centre

 

voucher identifier

specific unique identifier for specimen voucher

organism

taxonomic name

organism name including cross-reference to lineage

country

country

isolation address of specimen

feature annotation

gene

symbol of gene sequenced

 

biological feature

name of specific feature (eg. CDS)

PCR primers

forward primer name(s)

name(s) of forward PCR primer(s)

 

forward primer sequence(s)

sequence(s) of forward PCR primer(s)

 

reverse primer name(s)

name(s) of reverse PCR primer(s)

 

reverse primer sequence(s)

sequence(s) of reverse PCR primer(s)


Table Ib. Recommended fields
Field group Field Description

sampling details

latitude/longitude

coordinates of sampling site

 

identified by

name of researcher who identified specimen

 

collected by

name of researcher who collected specimen

 

collection date

date of collection of specimen

Alternative variable field data entry routes

While the custom web forms provide a suitable tool for the entry of medium-scale variable field data sets, as the number of sequences grows, they become less and less viable as a submission option. For this reason, the custom web forms allow upload of variable field information in fasta format (figure IV). Users can switch from fasta upload to web form view and overview at will to make minor edits once data have been uploaded.


Figure IV. FASTA upload

Fig 4

FASTA format is widely supported by a number of tools, including many that are open source. However, a certain degree of bioinformatic expertise may be required to generate fasta files for upload, so EMBL-Bank is able to accept BARCODE submissions in a number of alternative systematic formats. Most systematic formats are suitable, as long as they can be easily converted to a systematic text format. A number of users choose to submit data in Microsoft Excel spreadsheet format, for example (figure V). Users with formats other than fasta are advised to indicate at the time of representative sample submission that they intend to upload variable field data in their specific format.


Figure V. Spreadsheet upload

Fig 5 spacer
spacer