spacer

EMBL - INSDC Sequencing Project Metadata Database User Guidelines

Project definition

A project is defined as a collection of INSDC database records originating from a single organisation, or from a consortium of coordinated organisations. The collective database records from a project make up a complete genome or metagenome and may contain genomic sequence, EST libraries and any other sequences that contribute to the assembly and annotation of the genome or metagenome. Projects group records either from single organism studies or from metagenomic studies comprising communities of organisms.

Project metadata fields, described below, are collected and presented by the INSDC databases to allow tracking of genome and metagenome projects and to aid in comparative analyses. The INSDC retains editorial control over project metadata and encourages requests for update by project submitters as new information becomes available and project details change. Mandatory information is required from submitters prior to acceptance of sequence data from the project. Updates should be routed through the original INSDC database to which the project was submitted; where it is not clear which database this is, any INSDC partner will forward requests as appropriate.

Registration of new sequencing projects is available here. For further details on submission and retrieval of project metadata, please contact datasubs@ebi.ac.uk.

Field definitions

 

Fields assigned by INSDC
project ID a unique identifier, assigned at the time of submission by the INSDC database which the submitter initially approached with data. It is recommended that submitters quote the project ID in all communication with INSDC databases to allow for easier and faster tracking of issues. The project ID field provides an umbrella identifier that points to all related sequence data for a project.
locus_tag prefix a unique locus_tag prefix, assigned to each project by an INSDC member database at the time of submission. The prefix allows unambiguous tracking of loci from project genomes. Guidelines for usage of /locus_tag in sequence records are available here.


Mandatory fields
submitter contact details at least one submitter name (family and first name(s)), along with an e-mail address. The field is intended to represent points of contact for the submission, rather than a list of consortium members; the submission reference of project sequence entries provides the opportunity to credit all project contributors.
submitting organisation(s) name and URL for at least one organisation. This organisation may be a sequencing centre, a submitting centre or a consortium. Multiple centres can be indicated where appropriate.
project type classification of project as single organism or metagenomic.
project name

(metagenomes projects only)

a short name for the project indicating the nature of the study
organism name

(for single organism projects only)
taxonomic name of the sequenced organism
strain/isolate/breed

(for single organism projects only)
strain/isolate/breed information, indicated in order to allow discrimination between multiple genome sequencing projects covering organisms of the same species
physical source of sequenced material

(for single organism projects only)
physical source of material used for sequencing. In the case of bacterial strains, a culture collection may be indicated. In the case of non-bacterial genomes, details of the clone collection are indicated. Where material has not been submitted to a central repository, the name and address of the researcher who has access is given.


Optional fields
project description the genetic, medical and historical relevance of the organism and why it is being sequenced. In the case of metagenomic projects, the significance of the particular environment that is being investigated can be described.
project URL URL for the sequencing project, for cases where a project-specific website has been established.
replicon names and estimated sizes replicon names and estimated sizes for segmented genomes. Replicon types are currently limited to chromosome, organellar genome and plasmid, but it is anticipated that this list will be extended in the future.
sequencing method sequencing method used for the project. The are currently four methods available (WGS, clone-based, array resequencing and WGS and clone-based), but it is anticipated that this list will grow as new technologies emerge.
sequencing depth average sequencing depth for the genome sequencing project
estimated/calculated genome size estimated genome size in megabases for the project. For completed prokaryotic genomes, calculated size can be indicated.


spacer
spacer