![]() |
Genome Reviews User ManualUser Manual Release 4.2 December 2008 EMBL Outstation Telephone: +44-1223-494400 Electronic mail: support@ebi.ac.uk This manual and the database it accompanies may be copied and redistributed freely, Table of contents
1) INTRODUCTIONGenome Reviews contains information about complete DNA molecules (chromosomes and plasmids), genes, transcripts and proteins, for complete genomes from bacteria, bacteriophage and selected eukaryota. Genome Reviews records are normally constructed by modifying the sequence and annotation of an entry deposited in the EMBL/Genbank/DDBJ sequence repository using data imported from other resources or calculated by sequence analysis. However, for some species, an alternative database (for example, a model organism database) may be used as the primary source of the sequence. Molecules, genes and transcript entities are assigned stable identifiers that are maintained between releases; and are mapped to the identifiers from the UniProt Knowledgebase which describe the corresponding protein product. More details about the Genome Reviews gene and transcript records can be found respectively in section 5 and section 6 of this user manual. Data files describing complete DNA molecules, or the set of records, comprising the genes and transcripts derived from each molecule (and each complete genomes), can be downloaded for all genomes in Genome Reviews. All data is available in FASTA format, and additionally in richer file formats containing more detailed, structured annotation: complete molecule records in Genome Reviews EMBL-like format; gene and transcript records in Genome Reviews EMBL CDS-like format. A complete description of EMBL format is available in the EMBL user manual (URL: http://www.ebi.ac.uk/embl), to which this document serves as a supplement. Where appropriate, reference is made to that document in this one. A description of the EMBL CDS format can be found here (URL: ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/README.txt). Complete genome records are available for download from the Genome Reviews FTP site; gene, transcript and protein records are available from the Integr8 FTP site. For searching Genome Reviews, an Ensembl-style browser is available, which provides zoomable graphical views of all chromosomes and plasmids represented in the database (see section 7.2). For information about the MySQL relational dump, see section 7.3. This document describes mainly the (small) changes in format between the Genome Reviews and EMBL formats, and the procedure by which the annotation in the Genome Reviews database is derived from primary data sources. The main body of this User Manual describes the features of the database and file format which will remain stable. Information which applies specifically to the current release of the database is presented in the Release Notes. The Release Notes also describe changes which are foreseen in future releases. It is likely that the need to represent new kinds of information in the database will necessitate changes or additions to the presentation of data. Such changes will be made as far as possible in ways which have minimal impact on user programs and procedures, and which maximise the compliance of Genome Reviews files with EMBL format. Users of Genome Reviews should cite the following publication: Kersey P., Bower L., Morris L., Horne A., Petryszak R., Kanz C., Kanapin A., Das U., Michoud K., Phan I., Gattiker A., Kulikova T., Faruque N., Duggan K., McLaren P., Reimholz B., Duret L., Penel S., Reuter I., Apweiler R. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Research Jan 1; 33 (Database Issue): D297-D302 (2005). Users who wish to be kept informed about changes and new developments should subscribe to the Genome Reviews mailing list genomereviews-announce@ebi.ac.uk at URL: http://listserver.ebi.ac.uk/mailman/listinfo/genomereviews-announce. Previous postings to the mailing list can be viewed through the link Genomereviews-announce Archives on the same page. 2) CONVENTIONS USED IN THE DATABASEThis section describes the general conventions which have been applied to the information in the database in order to achieve uniformity of presentation. Specific abbreviations and symbol usage are summarised in the appendices. The same conventions apply as in the EMBL Nucleotide Sequence Database. 2.2 Organism Identification and Classification The unified taxonomy used by the collaborating databases DDBJ/EMBL/GenBank is re-used in Genome Reviews. The taxonomic information relevant to the entry is described in the OS and OC lines of the entry, and the primary source feature (which describes the origin of the sequence). However, alternative names for individual taxonomic nodes may be used, according to the conventions used in the HAMAP project (URL: http://www.expasy.org/sprot/hamap/ for further details). Also, some further standardisation is applied, with the the node descriptors 'biotype' and 'serotype' being replaced by their synonyms 'biovar' and 'serovar'. Taxonomic information appears in three places in each Genome Reviews file:
For example, in entry AE009952_GR.dat, the organism was identified as follows: OS Yersinia pestis (biovar Mediaevalis, strain KIM5) OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; OC Enterobacteriaceae; Yersinia XX FT source 1..4600755 FT /chromosome="Chromosome" FT /organism="Yersinia pestis" FT /biovar="Mediaevalis" FT /strain="KIM5" FT /mol_type="genomic DNA" FT /db_xref="taxon:187410" The full name is represented in the OS line; the genus and species are given as the value associated with the "organism" qualifier of the source feature; the strain and biovar represented in separate qualifiers; and the cross-reference identifies the most precise taxonomic node available to describe this organism based on all the above information. Note that an entry may have more than one source feature: in this case, the primary source feature is distinguished by use of the /focus feature qualifier. Secondary source features may describe insertion sequences within the main sequence. Data associated with taxonomic feature qualifiers in secondary source features is not changed in Genome Reviews (compared to the original submission). Literature references are presented for each entry. If a sequence has been submitted to EMBL/Genbank/DDBJ prior to publication, the submission itself is acknowledged. When a paper is subsequently published, such acknowledgements are usually removed. The most significant change to EMBL format in Genome Reviews concerns the introduction of evidence tags. Evidence tags describe the source of information that has been imported into Genome Reviews files. EMBL format supports only the attachment of evidence to features (through the use of the /evidence feature qualifier, but not to feature qualifiers themselves, hence the need for the introduction of a mechanism to describe the evidence for the attachment of an individual qualifier to a feature. This new tagging format has also been applied within the existing /evidence qualifier, where it describes the source of the information that has led to the inclusion of an additional feature in an entry. Evidence tags have been
applied to feature qualifiers and features. The use of evidence
tags may be extended to other data items within the entry
at a later date. Additional information about how data is
imported into Genome Reviews files can be found in Appendix
III of this document. A Genome Reviews file is usually derived from a primary source entry in the EMBL Nucleotide Sequence Database, although sometimes it may be derived from an alternative source (see section 4.1 for more information). The identity of the source entry is given in the CC lines of the corresponding Genome Reviews entry (see section 3.4.15). The identity of the source database entry is also indicated by the Genome Reviews entry name and accession number (which are derived from the accession number of the parent entry: see sections 3.4.1 and 3.4.2). However, a Genome Reviews file may also contain data imported from other sources. This is possible because:
and so on. An "evidence tag", attached to an element of a Genome Reviews entry, provides a pointer to (an element of) an external resource from which the tagged data was derived. A given tagged item may have one or many evidence tags: each tag provides independent evidence for the inclusion of the data item in the Genome Reviews entry. In example (iii) above, the concept of "evidence chaining" is introduced, whereby a series of databases are used to derive evidence that can be added to a Genome Reviews entry. The most obvious real example is that of GO annotations added to CDS features in Genome Reviews records according to the InterPro classification of a protein sequence. These are derived from the existence of cross-references from EMBL features (stored in the EMBL database) to records in the UniProt Knowledgebase (UniProt KB); the existence of cross-references between UniProt and InterPro (stored in the InterPro database); and the existence of cross-references between InterPro and GO (stored in the GOA database). Taking this information together allows terms from the GO controlled vocabulary to be propagated to Genome Reviews. In the case of Genome Reviews records where the primary source data does not come from EMBL, an extra step in the mapping procedure may be necessary (see section 4.2 for details). The evidence tag model used for Genome Reviews does not support evidence chaining. A flat file distribution format is not suitable for describing complex chains of inference, whose structure may vary according to the data source. In Genome Reviews, for each chain of inference a single source is always identified as the most appropriate to present in the corresponding evidence tag; reference to this source should provide further information to reveal the complete chain of inference. More details about the form the chain of inference implied by evidence tags in individual contexts can be found in Appendix IV of this document. An evidence tag may contain a reference to the particular element of the resource from which the tagged data item was derived, as well as to the resource itself. Typically this element takes the form of the identifier of an "entry" in the resource. In some cases, the concept of "entry" is not applicable in a particular resource, or the inclusion of entry-level information would be redundant, or the resource itself is a secondary repository of data and it is desirable to propagate the evidence presented in that resource to Genome Reviews. Some examples are given in the following section. A full description of what values are associated with each database that can be mentioned in a tag is given in Appendix III of this document. 2.4.2. Format of Evidence Tags Here are some examples of evidence tags. As explained in the previous section, evidence tags have so far been applied only to feature qualifiers and features. A tag applying to a qualifier is incorporated in that qualifier and provides evidence for the addition of this qualifier to this feature. A tag applied to a feature is presented as the value of an additional '/evidence' qualifier added to the feature and provides the evidence for the inclusion of the feature in the entry.. For a full description of how evidence tags are added to feature qualifier lines, see section 3.4.16. For the purposes of this section, consider only the tags themselves. All tags applied to a single data item are listed between a single pair of curly braces, separated (where necessary) by a semi-colon and a space.
FT /product="Hypothetical protein Xfb0002 {UniProtKB/TrEMBL:Q9PHK5}
FT /pseudo="{UniParc:!AAD06288}
FT /evidence="{UniProtKB/TrEMBL:Q9PHK5}"
FT /product="Hypothetical protein Xfb0002 {UniProtKB/TrEMBL:Q9PHK5;
UniProtKB/Swiss-Prot:P12345}"
An evidence tag always identifies a source database; and may additionally identify elements of that database specifically linked to the data item. The contents of a tag are determined separately for each database referred to, in order to provide the most relevant and useful information. The contents of each existing type of reference are described in Appendix IV. In terms of format, there are two possible models for an evidence tag, representing a tags in which none, one or many pieces of information from the source database are given. These are illustrated by the first three examples above.
In the second and third examples above: FT /pseudo="{UniParc:!AAD06288}"
FT /evidence="{UniProtKB/TrEMBL:Q9PHK5}"
the evidence tag is presented as the sole value associated with a qualifier that contains no intrinsic value of its own. In the third example above: FT /evidence="{UniProtKB/TrEMBL:Q9PHK5}"
the evidence tag is attached to the "evidence" qualifier, and there is no additional data associated with this qualifier besides the tag. This indicates that this tag provides information about the source of the feature to which this qualifier has been attached. This is the only allowable use of the "evidence" qualifier in Genome Reviews. In the fourth example above: FT /product="Hypothetical protein Xfb0002 {UniProtKB/TrEMBL:Q9PHK5;
UniProtKB/Swiss-Prot:P12345}"
more than one entry/chains of records have been used to independently infer that a particular feature qualifier should be added. Evidence for each independent inference is given in its own tag, separated by the use of a semi-colon and a following space. The list of all tags is collectively surrounded by a pair of curly braces. A formal description of the format of evidence tags is given in Appendix IV of this document. For users who do not wish to filter information by source, a program is provided with this release to remove evidence tags from Genome Reviews files, resulting in the production of "normal" EMBL format files. This program is written in the Java programming language and will run on any platform on which a Java runtime environment has been installed. Such environments are available free of charge for many platforms (including Microsoft Windows, Mac OS and GNU/Linux) from either Sun Microsystems (URL: http://java.sun.com/j2se/index.html or your hardware vendor. The tag removal program itself is available:
If you choose to download the tar archive, untar it as follows:
If you choose to download the raw source code, you will need to copy the complete directory structure uk/ac/ebi/genomeReviews/ You will then need to compile the java class:
Run the compiled code using, either:
Alternatively the program can be run from the executable
jar
where <directory> is the path to the directory where the Genome Reviews files are located, and <file-name> is the name of a Genome Reviews file contained in this directory. If only the single parameter <directory> is used, then the program with remove the evidence tags from ALL Genome Reviews files located in that directory. Usage information can be generated by typing
3) FORMAT OF THE DATABASEThe class of each entry is indicated on the first (ID) line of the entry. For Genome Reviews, records distributed and made publicly available are of data class 'GRV': Class Definition ----- ----------------------------------------------------------- The records which constitute the EMBL Nucleotide Sequence Database are grouped into divisions. The ID line of each entry indicates its division, using three letter codes. Currently, Genome Reviews records fall into one of four of these divisions: Code Division ----- ---------------------- FUN Fungi PHG Bacteriophage PLN Plants PRO Prokaryotes The structure of a Genome Reviews "component" record (describing a completely sequenced chromosome or plasmid that forms all or part of a completely sequenced genome) mirrors that of an record in the EMBL Nucleotide Sequence Database. The line types of a Genome Reviews record are all legitimate EMBL line types (although some legitimate line types in EMBL have been removed from Genome Reviews files), and, as far as possible, uses the same features and feature qualifiers. In some cases it has been necessary to add new features or feature qualifiers, or to redefine the meaning of an existing term, in order to support the concepts of Genome Reviews. Genome Reviews data is also available as "gene sets": sets of gene records each representing one gene present in a Genome Reviews component record. The format for a gene record is based on the format used in the EMBL CDS database (which itself is closely related to the main EMBL record format). Certain line types are missing, and an additional line type, the PA line, is added. A description of the EMBL CDS format can be found here (URL: ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/README.txt). The following line types do not appear in Genome Reviews records. DR - database cross-reference
Figure 1 - A sample record from the database
ID IGI00270102; SV 1; linear; genomic DNA; GRV; PRO; 207 BP.
XX
PA AP001918_GR.1
XX
DE srnB
XX
OS Escherichia coli (strain K12)
OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
OC Enterobacteriaceae; Escherichia.
OG Plasmid F
OX NCBI_TaxID=83333;
XX
FH Key Location/Qualifiers
FH
FT source 1..207
FT /organism="Escherichia coli"
FT /strain="K12"
FT /mol_type="genomic DNA"
FT /plasmid="Plasmid F"
FT /db_xref="taxon:83333"
FT CDS 1..207
FT /codon_start=1
FT /gene_name="srnB {UniProtKB/Swiss-Prot:P13970}"
FT /locus_tag="ECOK12F004 {UniProtKB/Swiss-Prot:P13970}"
FT /product="Protein srnB {UniProtKB/Swiss-Prot:P13970}"
FT /cellular_component="integral to membrane {GO:0016021}"
FT /protein_id="BAA97874.1 {EMBL:AP001918}"
FT /db_xref="EMBL:AAA98078.1 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="EMBL:AAA99006.1 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="EMBL:CAA32614.1 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="EcoGene:EG40018 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="GO:0016021 {GOA:P13970}"
FT /db_xref="HOGENOM:HBG270039 {HogenProt:P13970}"
FT /db_xref="InterPro:IPR000021 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="UniParc:UPI0000135F51 {EMBL:BAA97874}"
FT /db_xref="UniProtKB/Swiss-Prot:P13970 {EMBL:AP001918}"
FT /transl_table=11
FT /translation="MKYLNTTDCSLFLAERSKFMTKYALIGLLAVCATVLCFSLIFRER
FT LCELNIHRGNTVVQVTLAYEARK"
FT CDS 58..207
FT /codon_start=1
FT /gene_name="srnB {UniProtKB/Swiss-Prot:P13970}"
FT /locus_tag="ECOK12F005 {UniProtKB/Swiss-Prot:P13970}"
FT /product="Protein srnB {UniProtKB/Swiss-Prot:P13970}"
FT /cellular_component="integral to membrane {GO:0016021}"
FT /protein_id="BAA97875.1 {EMBL:AP001918}"
FT /db_xref="EMBL:AAA98078.1 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="EMBL:AAA99006.1 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="EMBL:CAA32614.1 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="EcoGene:EG40018 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="GO:0016021 {GOA:P13970}"
FT /db_xref="HOGENOM:HBG270039 {HogenProt:P13970}"
FT /db_xref="InterPro:IPR000021 {UniProtKB/Swiss-Prot:P13970}"
FT /db_xref="UniParc:UPI0000161C9D {EMBL:BAA97875}"
FT /db_xref="UniProtKB/Swiss-Prot:P13970 {EMBL:AP001918}"
FT /transl_table=11
FT /translation="MTKYALIGLLAVCATVLCFSLIFRERLCELNIHRGNTVVQVTLAY
FT EARK"
XX
SQ Sequence 207 BP; 53 A; 40 C; 55 G; 59 T; 0 other;
atgaagtacc ttaacactac tgattgtagc ctcttccttg cagagaggtc aaagtttatg 60
acgaaatatg cccttatcgg gttgctcgcc gtgtgcgcta cggtgttgtg tttttcactg 120
atattcaggg aacggttatg tgagctgaat attcacaggg gaaatacagt ggtgcaggta 180
actctggcct acgaagcacg gaagtaa 207
//
Figure 2 - A sample gene record from the database This section describes in detail the use made by Genome Reviews flat files of each line type as defined in the EMBL format. The ID (IDentification) line is always the first line of an entry. The general form of the ID line is: ID entryname; sequence version; topology; molecule; data class; division; sequence length BP. Entryname: stable identifier, consisting of alphanumeric character, starting with a letter. All letters should be in upper case. The entryname is provided only for reasons of compatibility with EMBL format and is redundant with the accession number (see section 3.4.2) in all Genome Reviews files. Sequence version: The second item on the ID line indicates the sequence version, e.g. SV 2. The initial version number assigned to a Genome Reviews entry depends on the primary source of the sequence used in making that entry.
Topology: The third item on the ID line indicates the topology of the sequenced molecule, either 'linear' or 'circular'. Molecule Type: The fourth item on the line is the type of molecule as stored. In the case of Genome Reviews files, this is always 'genomic DNA. Data class: The fifth item on the ID line indicates the data class of the entry, always 'GRV' for Genome Reviews files. Taxonomic database division: This 3-letter code designates the Genome Reviews taxonomic division of the genome, currently PRO (prokaryotes), PHG (bacteriophage), PLN (plants) or FUN (fungi). Sequence length: The last item on the ID line is the length of the sequence (the total number of bases in the sequence). This number includes base positions reported as present but undetermined (coded as "n"). An example of a complete identification line is shown below: ID AE003850_GR; SV 3; circular; genomic DNA; GRV; PRO; 1286 BP. The AC (ACcession number) line lists the accession numbers associated with this entry. An example of a Genome Reviews accession number line is shown below: AC AE003850_GR; Each accession number is terminated by a semicolon. Where necessary, additional AC lines are used. The Genome Reviews accession number comprises the characters of the accession number of the EMBL entry from which the Genome Reviews entry is derived suffixed by '_GR'. In some cases, an EMBL entry that had represented a particular chromosome or plasmid is supplemented by a new submission (with a new accession number) that represents a re-annotation or re-sequencing of the same biological molecule. When this happens, a new Genome Reviews entry will be produced (with an AC based on the new EMBL entry), but the old Genome Reviews accession will be added as a secondary accession number, to indicate that both records describe the same molecule. In the case of Genome Reviews records derived from alternative data sources,an EMBL accession number is still used as the prefix in the Genome Reviews entry, if an EMBL entry exists that describes the same molecule as the Genome Reviews entry (e.g. the third chromosome of the budding yeast Saccharomyces cervisiae is represented in EMBL by the entry whose accession number is X59720, and the accession number of the corresponding Genome Reviews is X59720_GR, although this particular Genome Reviews entry is used data from the Saccharomyces Genome Database as its primary data source. chromosomes or plasmids where there is no corresponding EMBL entry The PA (Parent Accession) line indicates the accession number (and version) of the parent Genome Reviews component record from which a gene record is derived. The format for a PA line is PA AP001918.1;where the accession number is given before the '.' character, and the sequence version of that accession is given afterwards. The DT (DaTe) line shows when an entry first appeared in the database and when it was last updated. Each entry contains two DT lines, formatted as follows: DT DD-MON-YYYY (Rel. #, Created) DT DD-MON-YYYY (Rel. #, Last updated, Version #) The DT lines from the above example are: DT 18-FEB-2004 (Rel. 0.1, Created) DT 26-SEP-2005 (Rel. 36, Last updated, Version 41) The second line indicates the last time that the contents of an entry was changed. It also contains the entry version, which is incremented each time that an entry is modified. The rules for the incrementation of entry versions and the updating of DT lines are similar to those applied in the EMBL Nucleotide Sequence Database. Genome Reviews files are released fortnightly, successive releases are numbered 1, 2, 3. etc. Before this release (release 1), a number of pre-releases were made, numbered 0.1, 0.2, 0.3 etc. The release number was incremented directly from 0.6 to 1 for the first full release. Versioning of individual records was unaffected by this change in release numbering. The DE (Description) lines contain general descriptive information about the sequence stored. The format for a DE line is: DE description In the case of Genome Reviews files, this comprises a full description of the organism sequenced (including genus, species and any relevant sub-levels of classification); a description of the molecule sequenced; and a declaration that the file describes the sequence of that molecule, for example: DE Xylella fastidiosa (strain 9a5c) plasmid pXF1.3, complete sequence. The format for a KW line is: KW keyword[; keyword ...]. Genome Reviews files typically contain 2 keywords: "complete genome" and "genome reviews" The OS (Organism Species) line specifies the preferred scientific name of the organism which was the source of the stored sequence, by giving the Latin genus and species designations, followed by more a specific classification where known. The complete format of the OS line is as follows: OS Genus species ([sub-species] [serogroup] [biovar] [pathovar]
[serovar] [strain] [sub-strain])
All descriptors below the level of species are contained in brackets. Descriptors at different levels are separated by commas; alternative names at a given level by the use of a forward slashes surrounded by a space on either side (' / '). An example is given in section 2.2. The OC line describes the taxonomic lineage of the sequenced organism, down to the level of the genus, according to the NCBI taxonomy. The OG (OrGanelle) linetype indicates the sub-cellular location of non-nuclear sequences. It is only present in entries containing non-nuclear sequences and appears after the last OC line in such entries. The OG line contains one data item, either "Mitochondrion", "Chloroplast", "Kinetoplast", "Cyanelle", "Plastid" or a plasmid name. The OX line is used in gene records; it contains the NCBI tax ID of the species. The OH (Organism Host) line specifies the most specific NCBI taxonomy ID and name of the host organism or host range. OH NCBI_Taxid: 272623; Lactococcus lactis (subsp. lactis, strain IL1403) 3.4.12 The Reference (RN, RC, RP, RX, RG, RA, RT, RL) Lines The reference lines in a Genome Reviews entry have the same format as those in the EMBL Nucleotide Sequence Database. The policy for inclusion of references is described in section 2.3. Note the the RX line includes at present cross-references only to PubMed, following the discontinuation of separate Medline identifiers. A sample RX line is shown below: RX PUBMED; 10910347. The FH line is used as in the EMBL Nucleotide Sequence Database. The format of the Genome Reviews feature table is essentially the same as that used in the EMBL Nucleotide Sequence Database. For a full definition of that Feature Table, please see the document "The DDBJ/EMBL/GenBank Feature Table: Definition" (URL: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html). However, there are some revisions to the format that have been made for the purposes of Genome Reviews, and considerable changes to the contents of individual records. These changes are discussed in this section of this document. The revised format is described in Backhaus-Naur form in section 4.1. As with EMBL features, the format design is based on a tabular approach and consists of the following items: In order to maximise compliance with existing EMBL parsers, evidence tags have been introduced as additional information included as part of the value of a feature qualifier. A conventional EMBL parser will, therefore, be expected to return the value plus evidence in response to a request for the value; to separate the two components, a deeper level of parsing will be required. Evidence tags are always located at the end of the qualifier value. They are contained within curly braces (i.e. between the '{' character and the '}' character) and preceded by a space. The tag and the value it tags are both contained within a single pair of double quotes. Wrapping of feature qualifiers containing tags follows the standard rules for feature qualifiers not containing tags, i.e. the presence of evidence tags does not affect how lines are wrapped. Where the qualifier had no associated value, the definition of the qualifier has been changed for Genome Reviews such that a value has been added, consisting only of the tag. Examples of the incorporation of evidence tags into feature qualifiers are given in section 2.4. The SQ line is used as in the EMBL Nucleotide Sequence Database. The sequence data line is used as in the EMBL Nucleotide Sequence Database. CC lines in EMBL records contain free text comments. CC lines are used in Genome Reviews to describe the primary source entry from which a Genome Reviews file was made, and the date on which it was produced. An example is given below. CC This Genome Reviews entry was created from entry AE003850.3 in the CC EMBL/GenBank/DDBJ databases on 26 September 2005. It is possible that other forms of comment will be introduced in future. The XX line is used as in the EMBL Nucleotide Sequence Database. The // line is used as in the EMBL Nucleotide Sequence Database. 4) DATA IMPORT PROCEDURESThis section of the manual describes the procedures used to import data into a Genome Reviews record. This information applies to all distribution formats of Genome Reviews; although the language of the flat file (e.g. "feature", "feature qualifier") is used to explain the procedures, the same data is also available in the relational distribution and visible in the Genome Reviews Browser. Likewise, certain EMBL-specific terms (like CDS, or CoDing Sequence, to refer to an annotated region of DNA that encodes a protein) are used, but the explanation applies to equivalent data in other source databases, regardless of the naming convention used in those resources.
The primary source of sequence (and other annotation) for a Genome Reviews record is usually the corresponding submission to the EMBL Nucleotide Sequence Database. However, for certain model organisms, the latest assembled sequence has not been submitted to EMBL. In these cases, Genome Reviews directly sources the DNA sequence from an accessible alternative source. Currently there are three of these: the Saccharomyces Genome Database (SGD) (Balakrishnan R. et. al. Nucleic Acids Res. 2005 Jan 1; 33 Database Issue:D374-7), used to source sequence information about the budding yeast Saccharomyces cerevisiae, and the Arabidopsis information Resource (TAIR) (Rhee S.Y. et al, Nucleic Acids Research 2003 31(1):224), used to source information about the thale cress Arabidopsis thaliana. Ustilago maydis data has been provided by the Munich Information Center for Protein Sequences (MIPS). The U. maydis chromosomes, originally sequenced and annotated by the Broad Institute, have been re-annotated at MIPS as part of their Ustilago maydis Annotation Project (Nature 2006 444, 97-101). Data is available from MUMDB. From all three primary primary data sources, the same essential procedure is followed. The latest assembled chromosomal DNA sequence is accessed, as is annotation associated with this sequence. The methods described in sections 4.2-4.4 are then used to improve and enhance this primary data. 4.2. Data import through identifier matching Much of the data imported into Genome Reviews is found in external databases. The use of common identifiers by different databases, and the maintenance of specific cross-references between them, can be used to identify equivalent entities and allow the transfer of annotation. This principle of this is discussed in section 2.4.1 of this document, and is applied using protein identifiers to map between the features described in the EMBL Nucleotide Sequence Database and records in the UniProt Knowledgebase, and thereafter to other resources When primary sequence data is sourced from either SGD, TAIR, or MIPS, a further step is added to the procedure, as cross-references between these resources and the UniProtKB are less well synchronised. In these cases, the sequence similarity approach (see section 4.3) is used to identify UniProtKB records cross-referencing to annotated features in the source database. 4.3. Data import through sequence matching Sequence similarity comparisons are run for two reasons during Genome Reviews production. Firstly, if the database used to source the sequence was not EMBL, the blastp protein sequence similarity algorithm from the BLASTALL package (Version 2.2.6 (4/9/03); Altschul et al., Nucl. Acids Res. (1997) 25:3389-3402) to identify the best matching entry in the UniProtKB for each annotated CDS feature in the source database. As explained in section 4.2, if the source database is EMBL, protein identifiers can be used to map UniProtKB to EMBL and sequence matching is not required. Secondly, sequence similarity matching may be performed to locate the accurate location of a CDS on the genome sequence corresponding to a protein sequence has been reported but which does not correspond to any existing annotated CDS in the source database. At present, the only database from which such non-annotated or incorrectly annotated protein sequences are identified, and subsequently mapped to the corresponding genomic sequence, is UniProtKB/Swiss-Prot, the manually curated portion of the UniProtKB. Reasons why a sequence in the UniProtKB may not correspond exactly to an annotated CDS feature in the source database include the following: (i) sequence variation between individual members of one species (ii) errors in DNA sequence (iii) errors in gene prediction i.e. missing predictions (iv) errors in boundary prediction i.e. start/stop codon incorrectly annotated (v) errors in translation prediction i.e. an authentic frameshift may have been missed (vi) two CDSs are annotated, but the UniProtKB curator believes that only a single protein is actually encoded (vii) one CDSs are annotated, but the UniProtKB curator believes that two separate proteins are is actually encoded .In Genome Reviews, we aim to provide a consistent picture of genomes and proteins. As such, we correct the primary source data to be consistent with the reference protein sequences provided in well-curated protein databases, wherever it is practical and meaningful to do this. We have implemented a pipeline that, with each Genome Reviews release, maps sequences in the UniProt Knowledgebase (from all archaeal and bacterial species represented in Genome Reviews) without an exact sequence match to an annotated feature in the source database entry describing the corresponding genome back onto the DNA sequence (we do not currently run this pipeline for the eukaryotic species represented in Genome Reviews). New, or adjusted, coordinates on the genome sequence defining the region encoding the "missing" protein are determined, and exceptions in the translation pattern are identified. These novel/adjusted annotations are then selectively imported into Genome Reviews. At present, we importing new/revised CDS features in the following cases: (i) Completely unannotated proteins, where there are <= 3 exceptions between the translation pattern predicted from the DNA, and the protein sequence being mapped. i.e. a maximum of 3 locations where either the reference protein sequence contains an amino acid other than that one would expect by translating the corresponding codon in the DNA sequence with the standard genetic code for that species, or where the reading frame appears disrupted in some way. (ii) Proteins that are mistakenly annotated with too many CDSs in the primary data source. (iii) Multiples of proteins that are mistakenly annotated by a single CDS in the primary data source. Having calculated the co-ordinates of the new/adjusted features, and selected which cases are to be included in Genome Reviews, it is now necessary to translate the results of the mapping (i.e. a protein-DNA alignment process) into the Genome Reviews flat file format. Crucially, any discrepancy between the protein sequence associated with a (new or modified) CDS feature, and the translation of the corresponding region of DNA sequence, needs to be fully described. To do this, we have introduced a small extension to the EMBL file format. (i) Simple translation exceptions (where amino acid X1 is found in the protein sequence, where amino acid X2 was expected from the DNA sequence) can be dealt with easily in the EMBL format using the /transl_except feature qualifier. (ii) Apparent frameshifts (any discontinuity that disrupts the reading frame of the literal translation) are harder to deal with (these may be the result of real frameshift events, of errors in sequence or of natural variation; but given the existence of such a discrepancy, it needs to be described regardless of the cause). The join statement can be used to represent the existence of apparently unused nucleotides (i.e. a coding region is defined in two portions, excluding those nucleotides that appear not to be part of any codon actually used to encode the reference protein sequence). (iii) The other possibility is that nucleotides are apparently absent from the DNA sequence (and a codon corresponding to a certain amino acid in the protein sequence cannot be found). A new feature qualifier /insertion has been introduced to the Genome Reviews file format to represent this. It is always possible to use the results of a sequence alignment to place new CDS features onto the genome that translate to a corresponding protein sequence subject to the differences described through the use of /transl_except and /insertion features, and join statements. Sometimes it is possible to represent the results of an alignment in more than one way, for example, a /transl_except feature could always be represented by an insertion and a deletion (join). In Genome Reviews, the /transl_except feature qualifier is only used in cases where there is a single mismatching codon within a well-defined reading frame. The /transl_except qualifier is not used at the margin of deletions or insertions, where one or both neighbours of an "exceptional" codon themselves fail to match to the protein sequence. Some simple examples are given in the next section. 4.3.1. Representation of aligned sequences and their discrepancies: some examples According to EMBL format, where the actual product of a CDS does not match the translation of the DNA according to the specified genetic code, the "real" sequence is entered under the /translation feature qualifier and any discrepancies are annotated accordingly. A join statement can be used to indicate a real or apparent frameshift, and the /exception and /transl_except feature qualifiers can also be used to indicate discrepancies. We have extended this scheme of Genome Reviews to enable us to describe the complete result of a sequence alignment using feature qualifiers and join statements in a standardised way. This enables the development of software capable of automatically "translating" from DNA to protein subject to the exceptions defined. We can consider three types of sequence discrepancies that may disrupt an alignment between the protein sequence and the matching region of DNA: deletion (i.e. there are >=1 bases in the DNA sequence that do not seem to be part of any codon), insertion (there are amino acids in the protein sequence that do not seem to have corresponding codons in the DNA sequence) and mis-translation. Such discrepancies do not necessarily represent biological phenomena, but may result from artifacts of the procedure by which the sequences were determined. 1: Deletion; During an Alignment, we may come across regions where one sequence does not match the other, and the algorithm used has decided to place a gap in the alignment rather than align two different elements. e.g. (Assuming we are aligning at the Amino Acid level) abc-def (Protein Sequence; '-' represents a 'gap')
|||.||| (Aligned Regions; '|' represents a match; '.' represents a 'gap')
abccdef (Genomic Sequence) This 'Deletion' event will not be explicitly labelled in the Genome Reviews files, but will be expressed by using the EMBL convention of adding an extra coordinate range to the JOIN statements found in the CDS feature instead. i.e. (Assuming our alignment shows amino acid sequences, and the offset of 1000 is purely for illustration) join(1000..1008,1012..1017) In the Genome Reviews context, this occurs when the reference proteins sequence does not match to the underlying genomic sequence, thus we add the 'gap' to the reference protein sequence in order to attain a better alignment. Deletions can span 1 or more nucleotides, though most are single amino acid triplet, and as they may not all be of unit triplet length (i.e. unit Amino Acids), there may be also an implicit frame shift that occurs. 2: Insertion; This will occur when the opposite of a Deletion happens; when the protein sequence contains regions that are not found in the underlying genomic sequence. e.g. (A reversal of the above example) abccdef (Protein Sequence)
|||.||| (Aligned Regions)
abc-def (Genomic Sequence) This 'Insertion' event will be explicitly labelled in the Genome Reviews files. It will appear as a separate sub-feature of the /CDS feature tag and be identified as "/insertion=..." i.e. /insertion="1008^1009,seq:C" In the Genome Reviews context, this is the reverse of the deletion event, thus we add the 'gap' to the genomic sequence in order to attain a better alignment. Insertions are allowed to span several amino acids, and as such are represented by their respective single character symbols. Where an insertion consists of multiple amino acids, these are presented according to their order in the protein. The numerical range (indicated by two sequential integers separated by a caret) indicates the nucleotides either side of the "missing" codon. Thus if there was a protein-coding DNA sequence as follows: ATGATG and we were to imagine the presence of two amino acids (A and B, in that order) between the two encoded methionines, this would be annotated as such: FT CDS 1..6
FT /insertion="3^4,seq:AB"
FT /translation="MABM"
whereas if the CDS extended in the opposite direction, and we were to imagine the presence of two amino acids (A and B, in that order) between the two encoded histidines, this would be annotated in this way: FT CDS complement(1..6) FT /insertion="3^4,seq:AB" FT /translation="HABH" Special case: C-terminal insertion: If in an alignment, the coding sequence that has been aligned with the genomic sequence shows a discrepancy at the 3' end (corresponding with the C-terminus of the encoded protein), and the underlying genomic sequence does not have a stop codon in that region, we represent the end location of the CDS feature with the position of the last nucleotide of the last matching triplet and we add an insertion between that position and the next nucleotide in the genomic sequence. The inserted sequence consists of the C-terminal amino acid(s) after the last match. 3: Translation Exception; These occur when the protein sequence has regions which are genuinely different from the underlying sequence, but the alignment algorithm has determined that it is best to keep the mismatch rather than adding a 'gap' to the range. This would typically occur when the regions surrounding the mismatch are good matches themselves, and adding a gap would adversely affect these regions. e.g. abccdef (Protein Sequence)
|||!||| (Aligned Regions, The '!' symbol is used to highlight the affected region)
abczdef (Genomic Sequence)
This 'Translation Exception' event is explicitly labelled in the Genome Reviews files. It appears as a separate qualifier of the CDS feature tag and be identified as "/transl_except=..." i.e. /transl_except="(pos:4235116..4235118,aa:Cys)" In a Genome Reviews context, we represent each single amino acid that differs as a separate element and hence we will use the standard three letter code to represent each, rather than just the single letter code.
When a UniProt Knowledgebase sequence is mapped to an EMBL feature, and as a result the original coordinates are altered, the genomic sequence, in a small number of cases the genomic sequence does not contain a termination codon. In such cases, a translation exception event is introduced indicating the position of the missing stop codon, e.g. /transl_except="(pos:12233..12235,aa:TERM)"
See Appendix III for more details on the application of each feature qualifier type.
4.3.2. Outline of mapping procedure for unannotated CDSs Records in the UniProt Knowledgebase representing proteins encoded by completely deciphered genomes, but which have not been annotated in the corresponding EMBL entry, are identified (such records are annotated as "unannotated CDSs"). These records are compared to the genome sequence using the program tblastn from the package BLASTALL (Version 2.2.6 (4/9/03); Altschul et al., Nucl. Acids Res. (1997) 25:3389-3402) to identify exact matches, i.e.cases where the a continuous coding sequence for entire protein can be identified on part of the corresponding DNA sequence, with no gaps or other irregularities. In such cases, the location data is included directly into the GenomeReviews data files. If however, a location is not found by this method, a customised version of the ALIGN tool (version 2.0u by Myers and Miller, CABIOS (1989) 4:11-17) is used to try and locate a CDS for the protein, which may be non-contiguous and thus also require more annotation to describe the differences between the protein and genomic sequences. The same procedure is also used to map sequences in the UniProt Knowledgebase that are mapped to coding sequences in the EMBL genome records (indicated in the UniProt entry through the presence of a cross-reference to the EMBL protein identifier), but where the sequence in the UniProt entry does not agree with the translation in the of the EMBL feature. In this case the aforementioned ALIGN program is used to find the regions of DNA in a state of disagreement with the reported protein sequence. Once identified these regions are referenced to the corresponding amino acids and the resulting conflicts between the UniProt protein sequence and the EMBL DNA sequence can thus be expressed through the use of join statements and certain feature qualifiers to be added to the Genome Reviews files, as discussed in the previous section. Data from the mapping process is selectively being incorporated into Genome Reviews. 4.3.3. Example of a new feature added through sequence comparison
FT CDS complement(34089..34256)
FT /evidence="{BLASTALL 2.2.6/ALIGN 2.0u}"
FT /product="Hypothetical protein yaaV
FT {UniProtKB/Swiss-Prot:P46415}"
FT /dbxref="UniParc:UPI0000139FFD {UniProtKB/Swiss-Prot:P46145}"
FT /dbxref="UniProtKB/Swiss-Prot:P46145 {UniProtKB/Swiss-Prot:P46145}"
FT /db_xref="Ecogene:EG12706 {UniProtKB/Swiss-Prot:P46145}"
FT /translation_table=11
FT /translation="MTRFRAIKQHKIVDISIVCNNFTVDKCELNPAYVIKNIDSPKDL
FT LNGQKKTVLIREPY" 4.4. Data import through sequence analysis Annotation of non-protein-coding genes in completely sequenced genomes is erratic and in some cases wholly absent. We have addressed this by running several computational analyses of genomes in which such genes have not been annotated, and by adding new annotations for such genes according to the results. We have implemented three pipelines of analysis, depending on the type of RNA genes we are aiming to detect. 4.4.1. Detection of non-protein-coding genes Analysis of non-protein-coding genes, other than transfer RNA genes and ribosomal RNA genes, as well as RNA motifs, is performed in a general manner, using the program rfam_script.pl. rfam_scan.pl is a perl wrapper for searching DNA sequences against the latest Rfam Database (Griffiths-Jones S. et al. (2005)) using the INFERNAL software package (Eddy S.R. (2002)). Depending on their type, the genes detected by this procedure are annotated as:
This classification conforms to proposals currently in the process of being implemented in the EMBL Nucleotide Sequence Database, and represents a change to the previous classification used in that database and Genome Reviews. As a consequence of this new classification, the snoRNA and snRNA feature types have been replaced by ncRNA feature type, whose subtype is defined in a ncRNA_class qualifier. The values of this qualifier are restricted to the following controlled vocabulary:
4.4.2. Detection of transfer RNA genes Analysis is specifically performed using tRNAScan-SE. See Lowe T.M. and Eddy S.R. (1997) "tRNAScan-SE: a program for improved detection of transfer RNA genes in genomic sequence", Nucl. Acids Res., 25, 955-964 See also http://www.genetics.wustl.edu/eddy/tRNAScan-SE/ Version 1.23 of the program was used, configured for superregnum as appropriate. New tRNA-encoding genes are annotated as tRNA features in the following way: FT tRNA complement(28316..28391)
FT /evidence="{tRNAScan-SE-1.23}"
FT /gene_name="tRNA:Ile (GAU) {tRNAScan-SE-1.23}"
FT /anticodon=(pos:28356..28358,aa:Ile) The codon and the corresponding amino acid are presented under the /gene_name qualifier. Note that if the tRNA gene is found on the reverse strand, the direction is indicated through the use of "complement", as with CDS features (but unlike tRNA genes in EMBL records). We are looking into the possibility of upgrading/replacing original tRNA annotations in future releases. 4.4.3. Detection of ribosomal RNA genes Analysis is specifically performed using RNAmmer. See Lagesen K. et al. (2007) "RNammer: consistent annotation of rRNA genes in genomic sequences", Nucl. Acids Res., 35, 3100-3108 See also http://www.cbs.dtu.dk/services/RNAmmer Version 1.2 of the program was used, configured for superregnum as appropriate. New ribosomal RNA-encoding genes are annotated as rRNA features in the following way: FT rRNA complement(28316..28391)
FT /evidence="{RNAmmer-1.2}"
FT /gene_name="16s_rRNA {RNAmmer-1.2}"
5) GENOME REVIEWS GENE RECORDSA gene record describes the DNA sequence that encodes the products of a gene. Where expression information is available about specific genes (e.g. information about promoters/UTRs), this has been used in defining the gene. However, for many genomes included in Genome Reviews, no specific information is available at the gene level; in these cases, a virtual gene is assumed to exist comprising all splicing variants, by taking the start and end coordinates of the longest coding sequence. Individual regions of known function or character are annotated as features on the sequence. Where a protein sequence in the UniProtKB does not exactly correspond with the nucleotide sequence in the gene set, differences between the conceptual translation and the actual protein sequence are represented through the use of qualifiers in the annotation, as they are in Genome Reviews component records. Data files representing sets of records, comprising the genes derived from each genome component molecule (and each complete genomes), can be downloaded for all genomes in Genome Reviews. All data are available in FASTA format, and additionally in a richer file format, EMBL CDS-like format, containing more detailed, structured annotation. Gene sets for components and complete genomes can be accessed by ftp using the appropriate link below:
6) GENOME REVIEWS TRANSCRIPT RECORDSA transcript record describes a processed transcript after any post-transcriptional events, such as splicing, may have occurred. In the case of polycistronic transcription, all genes that are known to be transcribed together will be part of the same transcript record. Where information is available about specific transcripts (e.g. information about promoters/UTRs; co-transcription; or alternative translational information), this has been used in defining the transcripts. However, for many genomes included in Genome Reviews, no specific information is available at the transcript level; in these cases, a virtual transcript is assumed to exist for each known protein product. Information about the relationship between entities at each of these levels is summarised in this README file. As with complete molecules and gene records, individual regions of known function or character are annotated as features on the sequence. Data files representing sets of records, comprising the transcripts derived from each genome component molecule (and each complete genomes), can be downloaded for all genomes in Genome Reviews. All data are available in FASTA format, and additionally in a richer file format, EMBL CDS-like format, containing more detailed, structured annotation. Transcript sets for components and complete genomes can be accessed by ftp using the appropriate link below:
7) SEARCHING AND DOWNLOADING GENOME REVIEWS7.1 Downloading Genome Reviews Access to Genome Reviews flat files is described in the release notes, available at ftp://ftp.ebi.ac.uk/pub/databases/genome_reviews/ReleaseNotes.txt. 7.2 Searching Genome Reviews through SRS Genome Reviews is available for search under the EBI's SRS server, as follows:
For more information on how to use the Query Forms, please see the SRS documentation at http://srs.ebi.ac.uk/doc/index.html 7.3 Searching Genome Reviews through Integr8 An Ensembl-style browser is now available for Genome Reviews, providing a zoomable graphical view of all chromosomes and plasmids represented in the database. The location and structure of all genes is shown and the distribution of features throughout the sequence is displayed. In the search box on either the Genome Reviews or Integr8 homepage, select your gene or protein of interest, optionally specify the species, and then click on the 7.4 Installation of a local Genome Reviews MySQL database The file gr_mysql-release_xx.sql.gz at ftp://ftp.ebi.ac.uk/pub/databases/genome_reviews/sql/ represents an export of Genome Reviews release xx from a MySQL database. The data corresponds to the same data found in the flat file distribution (ftp://ftp.ebi.ac.uk/pub/databases/geneome_reviews/dat/). The database schema is essentially the same as that used by Ensembl (http://www.ensembl.org/) to describe higher eukaryotic genomes. USAGE: We recommend you use MySQL version 4 or later, though previous versions may work. If you get a 'Packet too large' error from your MySQL server, you will have to increase the value of the server variable max_allowed_packet above the default value of 1M to at least 32M. For MySQL version 4.0 or later, this can be done by adding the following line to the [mysqld] section of your my.cnf file and restarting the MySQL server daemon. [mysqld] max_allowed_packet = 32M Please note that MySQL version prior to 4.0 do not support higher values. See your MySQL documentation for further information. Download the latest genome reviews schema from ftp://ftp.ebi.ac.uk/pub/databases/genome_reviews/sql/, file gr_mysql-release_xx.sql.gz where xx is the current release. If you are using unix/linux then you can import the schema and its data into your MySQL database as follows:
The command line options are as follows:
$HOST - the host (optional if you are using localhost).
Unix/linux (command is on a single line): gunzip -c gr_mysql-release_xx.sql.gz | mysql -h $HOST -P $PORT -u $USER -p$PASS -D $DBNAME
Windows: mysql -h $HOST -P $PORT -u $USER -p$PASS $DBNAME -f < gr_mysql-release_xx.sql 8) APPENDICES8.1 Appendix I: Feature table: Backus-Naur form Feature table is a mandatory part of an entry. Full entry syntax is specified elsewhere. This definition is an amended version of the feature table definition given in the EMBL feature table document (URL: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html). The primary formal changes in specification derive from the the introduction of evidence tags into the feature qualifiers of Genome Reviews records. feature_table ::= <feature_table_header><feature_table_body>
feature_table_header ::= FH Key Location/Qualifiers |
FEATURES Location/Qualifiers
feature_table_body ::= <feature> | <feature_table_body><feature>
At least one feature is required.
feature ::= <feature_key><feature_details>
Key is required, location required, qualifier list optional
feature_key ::= <symbol>
feature_details ::= <location><qualifier_list> | <location>
There exists a table of legal keys.
location ::= <absolute_location> | <feature_name> |
<functional_operator>(<location_list>)
absolute_location ::= <local_location> | <path> : <local_location>
path ::= <database> :: <primary_accession> | <primary_accession>
feature_name ::= <path>:<feature_label> | <feature_label>
feature_label :== <symbol>
local_location ::= <base_position> | <between_position> | <base_range>
location_list ::= <location> | <location_list>,<location>
functional_operator ::= <symbol>
base_position ::= <integer> | <low_base_bound> | <high_base_bound> |
<two_base_bound>
low_base_bound ::= > <integer>
high_base_bound ::= < <integer>
two_base_bound ::= <base_position>.<base_position>
between_position ::= <base_position>^<base_position>
base_range ::= <base_position>..<base_position>
database ::= <symbol>
primary_accession ::= <symbol>
sequence_character ::= a | b | c | d | g | h | k | m | n | r | s | t | u | v | w | y
qualifier_list ::= <qualifier> | <qualifier_list><qualifier>
qualifier ::= /<qualifier_name> | /<qualifier_name>=<value>
qualifier_name ::= <symbol>
value ::= <simple_value> | (<value_list>) | (<tagged_value_list>) |
simple_value ::= <integer> | <location> | <reference_number> | "<text_string>" |
"<text_string> <evidence_tag>" | <symbol>
value_list ::= <value> | <value_list>,<value>
tagged_value_list ::= <tagged_value> | <tagged_value_list>,<tagged_value>
tagged_value ::= <tag>:<value>
tag ::= <symbol>
reference_number ::= [ <unsigned_integer> ]
symbol ::= <letter> | <symbol><symbol_character> | <symbol_character><symbol>
text_string ::= <string_character>| <text_string><string_character>
evidence_tag ::= { <evidence_item_list> }
evidence_item_list = <evidence_item> | <evidence_item>; <evidence_item_list>
evidence_item ::= <text>:<evidence_value>
evidence_value ::= <text> | !<text>
unsigned_integer ::= <digit> | <unsigned_integer><digit>
integer ::= <unsigned_integer> | - <unsigned_integer>
string_character ::= <letter> | <digit> | <punctuation> | ""
symbol_character ::= <up_case_letter> | <low_case_letter> |<digit> | _ | - | ' | *
letter ::= <up_case_letter> | <low_case_letter>
up_case_letter ::= A | B| ... | Z
low_case_letter ::= a | b | ... | z
digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
punctuation ::= <space> | ! | # | $ | % | & | ' | ( | ) | * | + | , |
- | . | / | : | ; | < | = | > | ? | @ | [ | \ | ] | ^ | _ | ` | { |
<bar> | } | ~
bar ::= |
space ::= ascii 32
8.2 Appendix II: Feature keys reference This appendix describes the use of features and their associated qualifiers in Genome Reviews records. The number of allowable feature keys and qualifier names has been reduced to standardise their usage, but some new feature keys and qualifier names have also been introduced. These lists are liable to revision in subsequent releases. 8.2.1 Feature key reference manual The following manual has been organised according to the following format: Feature Key the feature key name
Definition the definition of the key
Mandatory qualifiers qualifiers required with the key; if there are
no mandatory qualifiers, this field is omitted.
Optional qualifiers optional qualifiers associated with the key
Comment comments and clarifications
Abbreviations:
accnum an entry primary accession number
<evidence_tag> evidence tag, as discussed in sections 2.4,
3.16 and 4.4
<integer> unsigned integer value
Feature Key CDS
Definition coding sequence; sequence of nucleotides that
corresponds with the sequence of amino acids in a
protein (location includes stop codon);
feature includes amino acid conceptual
translation;
Optional qualifiers /biological_process="<GO_term> <evidence_tag>"
/cellular_component="<GO_term> <evidence_tag>"
/db_xref="<database>:<identifier> <evidence_tag>"
/EC_number="text <evidence_tag>"
/function="text <evidence_tag>"
/gene_name="text <evidence_tag>"
/gene_synonym="text <evidence_tag>"
/locus_tag="text <evidence_tag>"
/product="text <evidence_tag>"
/product_synonym="text <evidence_tag>"
/protein_id="<identifier> <evidence_tag>"
/pseudo="<evidence_tag>"
/translation="text"
/transl_table =<integer>
Comment /codon_start has a valid value of 1 or 2 or 3,
indicating the offset at which the first
complete codon of a coding feature can be found,
relative to the first base of that feature;
/transl_table defines the genetic code table
used if other than the universal genetic code
table; genetic code exceptions outside the range
of the specified tables are reported in /codon
or /transl_except qualifier /protein_id consists
of a stable ID portion (3+5 format with 3
position letters and 5 numbers) plus a version
number after the decimal point; when the protein
sequence encoded by the CDS changes, only the
version number of the /protein_id value is
incremented; the stable part of the /protein_id
remains unchanged and as a result will
permanently be associated with a given protein;
/transl_table and /translation not used for
pseudogenes (i.e. not used in conjunction with
/pseudo).
Feature Key conflict
Definition independent determinations of the "same" sequence
differ at this site or region
Mandatory qualifiers /citation=[number]
Optional qualifiers /replace="text"
Feature Key gap
Definition gap in the sequence
Mandatory qualifiers /estimated_length=unknown or
8.3 Appendix III: Summary of qualifiers for feature keys 8.3.1 Qualifier ListThe following is a list of available qualifiers for feature keys and their usage. It also describes the procedures by which data of each type is identified and imported. A full list of sources from which data is imported is given in Appendix 8.5. As noted in section 4.2, the number of qualifiers has been reduced in Genome Reviews compared with the EMBL Nucleotide Sequence Database. Some new qualifiers have been added and the data content of some qualifiers has been altered. The most notable change to feature qualifiers has been the introduction of evidence tags, described in section 2.4 of this document. In this Appendix, "EMBL" refers to the EMBL Nucleotide Sequence Database; "UniProt" to the UniProt Knowledgebase; "UniParc" to the UniProt Archive; and "GOA" to the Gene Ontology Annotation Database (see Appendix 8.5). Qualifier name of qualifier; qualifier requires a value if
followed by an equal sign
Definition definition of the qualifier
Value format format of value, if required
Example example of qualifier with value
Comment comments, questions and clarifications
Data source explanation of how data of this type is sourced for Genome Reviews
Qualifier /anticodon
Definition location of the anticodon of tRNA and the amino acid for which
it codes
Value format (pos:<base_range>,aa:<amino_acid>) where base_range is the
position of the anticodon and amino_acid is the abbreviation
for the amino acid encoded
Example /anticodon=(pos:34..36,aa:Phe)
Data source may be taken from the parental EMBL entry, or derived through
sequence analysis;
Qualifier /biological_process=
Definition biological process to which the product of this CDS
takes part in;
Value format "<GO_term> <evidence_tag>"
Example /biological_process="protein folding {GO:0006457}"
Comment biological_process is defined using a term from
the Gene Ontology, a controlled vocabulary for
describing gene products; GO terms are divided
between 3 primary hierarchies (function, biological_process and
cellular component);
Data source GO terms are imported into Genome Reviews files from
GOA, a database of associations between gene products
and GO terms; the gene products in GOA are identified
by UniProtKB IDs and can be mapped to CDS features via
the cross-references between EMBL and UniProtKB;
Qualifier /biovar=
Definition a sub-species level taxonomic characterisation based
on physiological characters; "biotype" is a synonym of
"biovar" but biovar is the correct term;
Value format "text"
Example /biovar="Orientalis"
Comment used only with the source feature key;
Data source a full taxonomic definition for each species in Genome
Reviews is imported from the HAMAP project and is
combined with the original taxonomic annotation in the
parent EMBL record;
Qualifier /cellular_component=
Definition cellular component to which the product of this CDS
has been localised;
Value format "<GO_term> <evidence_tag>"
Example /cellular_component="cytoplasm {GO:0005737}"
Comment the cellular component is defined using a term from
the Gene Ontology, a controlled vocabulary for
describing gene products; GO terms are divided
between 3 primary hierarchies (function, biological_process and
cellular component);
Data source GO terms are imported into Genome Reviews files from
GOA, a database of associations between gene products
and GO terms; the gene products in GOA are identified
by UniProtKB IDs and can be mapped to CDS features via
the cross-references between EMBL and UniProtKB;
Qualifier /chromosome=
Definition chromosome (e.g. chromosome number) from which
the sequence was obtained;
Value format "text <evidence_tag>"
Comment used only with the source feature key;
Example /chromosome="1"
Qualifier /citation=
Definition reference to a citation listed in the entry reference field
Value format [integer-number] where integer-number is the number of the
reference as enumerated in the reference field
Example /citation=[1]
Comment used to indicate the citation providing the claim of and/or
evidence for a feature; brackets are used for conformity.
Qualifier /cultivar=
Definition a cultivated selection from a plant population that
can be propagated reliably in a prescribed manner;
Value format "text <evidence_tag>"
Comment used only with the source feature key;
Example /cultivar="Columbia"
Qualifier /db_xref=
Definition database cross-reference: pointer to related
information in another database;
Value format "<database>:<identifier> <evidence_tag>" where
database is the name of the database containing
related information, and identifier is the internal
identifier of the related information according to the
naming conventions of the cross-referenced database.
Example /db_xref="InterPro:IPR001957 {UniProtKB/Swiss-Prot:Q8PEH5}"
Comment the complete list of cross-references currently used
in Genome Reviews is given in Appendix 4.5 of this
document;
Data source the original cross-references in an EMBL entry are
supplemented by additional cross references obtained
from corresponding records in the UniProtKB, UniParc and
GOA databases;
Qualifier /EC_number=
Definition Enzyme Commission number for enzyme product of sequence
Value format "text <evidence_tag>"
Example /EC_number="6.3.1.1 {UniProtKB/Swiss-Prot:Q8ZJT3}"
Comment valid values for EC numbers are defined in the list
prepared by the IUPAC-IUB Commission on Biochemical
Enzyme Nomenclature
(published in Enzyme Nomenclature 1984 New York:
Academic Press (1984) or a more recent revision
thereof).
Data source EC numbers in Genome Reviews file are derived from UniProtKB,
where the original annotation in EMBL records may be
supplemented, corrected, or deleted by curators; EC_numbers
describing portions of a protein sequence may be
transferred to corresponding novel features in the Genome
Reviews entry subject to sequence agreement;
Qualifier /evidence=
Definition evidence supporting the inclusion of a feature (as opposed
to a feature qualifier) in a Genome Reviews entry;
Value format "<evidence_tag>"
Example /evidence="{UniProtKB/Swiss-Prot:Q8ZJT3}"
Comment the /evidence qualifier is used in Genome Reviews records
to hold information about the source of the information used
to attach a novel feature to an entry; the evidence takes
the form of an evidence tag (note that when tags are
attached to other qualifiers, they indicate the source of
the information used to attach that qualifier to a feature);
where no evidence qualifier is used, it can be assumed that
the feature was included in the primary source entry in the
EMBL Nucleotide Sequence Database from which this Genome
Reviews entry is derived;
Data source Whatever data source is described in the tag;
Qualifier /focus
Definition identifies the primary source of a Genome Reviews entry,
where there are > 1 source features;
Value format "text <evidence_tag>"
Example /focus
Comment secondary source features may exist, for example in the
case where an insertion sequence is present in a chromosome;
Data source the original EMBL entry on which the Genome Reviews entry is
based;
Qualifier /function=
Definition function attributed to a sequence;
Value format "text <evidence_tag>"
Example /function="3'-5'-exonuclease activity {GO:0008408}"
Comment the data stored under the /function qualifier is defined
using a term from the Gene Ontology, a controlled vocabulary
for describing gene products; GO terms are divided between 3
primary hierarchies (function, biological_process and cellular
component);
Data source GO terms are imported into Genome Reviews files from
GOA, a database of associations between gene products
and GO terms; the gene products in GOA are identified
by UniProtKB IDs and can be mapped to CDS features via
the cross-references between EMBL and UniProtKB;
Qualifier /gene_id=
Definition a stable Integr8/Genome Reviews gene identifier that
uniquely identifies a gene
Value format "text"
Example: /gene_id="IGI00723232"
Comment gene_id qualifiers are currently only assigned to protein
coding genes and are added to CDS features; the format of the
gene identifier is 'IGI' followed by 8 digits. Please
note that gene IDs are unique for each gene, not necessarily
for each coding region; e.g. in case of alternative splicing,
splice variants of the same gene carry the same gene identifier.
Data source gene_ids are generated during the Integr8/Genome Reviews
production pipeline.
Qualifier /gene_name=
Definition symbol of the gene corresponding to a sequence region
Value format "text"
Example /gene_name="ilvE {UniProtKB/Swiss-Prot:Q8ZJT3}"
Comment a gene can be considered as a collection of functionally
features, some of which some may be CDSs (coding sequences),
and other of which may be promoters, UTRs, mRNAs, etc; in
EMBL records, the gene feature is typically used to
mark the span enclosing all such features; this can
cause problems where genes overlap; in addition, for
most EMBL records representing complete genome
sequences, the only feature belonging to each gene
that has been annotated is a single CDS feature, and
the gene feature (as used) is redundant with this; therefore,
in Genome Reviews, the gene feature has been dropped;
if several features belong to the same gene, this is
indicated by the qualification of those features with
identical /gene_name and /locus_tag qualifiers; the
/gene_name qualifier is used to indicate the primary,
biologically relevant name for a gene; where other
names are available, these are indicated using the
/gene_synonym qualifier; ordered systematic names
(which do not imply biological function) are stored
using the /locus_tag qualifier;
Data source data in gene qualifiers (applied to CDS features) in Genome
Reviews files is derived from UniProtKB, where the original
EMBL-derived names may be supplemented, corrected, or
deleted by curators; data in gene qualifiers (applied to
tRNA features) may be taken from the parental EMBL entry,
or derived through sequence analysis;
Qualifier /gene_synonym=
Definition symbol of the gene corresponding to a sequence region
Value format "text"
Example /gene_synonym="BACA {UniProtKB/Swiss-Prot:Q8PDZ9}"
Comment where more than one gene name is available, secondary names
are stored under the /gene_synonym qualifier; /gene_synonym
qualifiers are only attached to the primary feature derived
from each gene, and not to secondary features (e.g. this
qualifier is attached to features such as CDS, rRNA, but
not features such as mat_peptide, peptide, which represent
processed versions of primary translations;
Data source data in gene_name qualifiers in Genome Reviews files is
derived from UniProtKB, where the original EMBL-derived
names may be supplemented, corrected, or deleted by
curators;
Qualifier /host=
Definition natural host from which the sequence was obtained;
Value format "text"
Comment added to phage records if absent in the original EMBL
parent entry and a single host is known from the scientific
literature. See also /host_range.
Example /host="Acyrthosiphon pisum"
Data source the original parent entry of this Genome Reviews entry
in the EMBL Nucleotide Sequence Database or scientific
literature;
Qualifier /host_range=
Definition (spectrum of) known natural hosts that a species/strain
can infect;
Value format "text"
Example /host_range="Prochlorococcus; Synechococcus"
Comment added to phage records if absent in the original EMBL
parent entry and multiple hosts are known from the scientific
literature. See also /host.
Data source scientific literature;
Qualifier /insertion=
Definition a special type of translational exception: comprises one or
many amino acids (indicated in single letter code) present
in a translation where the corresponding codon is not
present in the underlying nucleotide sequence
Value format (pos:location,seq:<amino_acids>, where amino_acids is extra
residues to be inserted, represented in single letter code
Example /insertion="531^532,seq:AV"
Comment insertion qualifiers may be used (in conjunction with join
statements) where only one nucleotide from a CDS is missing;
this can be represented as an insertion of one amino acid
(corresponding to 3 nucleotides), and a gap of 2 nucleotides
in the coding sequence; amino acids are presented in protein
coding order; the numerical range (indicated by two
sequential integers separated by a caret) indicates the
nucleotides either side of the "missing" codon; amino acids
to be inserted are given according to their order in the
protein;
Data source insertion qualifiers are derived from a mapping process applied
when comparing reference protein sequences to the genomic DNA
sequence;
Qualifier /isolate=
Definition individual isolate from which the sequence was obtained;
Value format "text"
Example /pathovar="Porton"
Comment used only with the source feature key;
Data source a full taxonomic definition for each species in Genome
Reviews is imported from the HAMAP project and is combined with
the original taxonomic annotation in the parent EMBL record;
Qualifier /locus_tag=
Definition a systematic name for a given gene, indicating its
relative position in the sequence with respect to
other genes; not indicative of biological function.
Value Format "text <evidence_tag>"
Example /locus_tag="RSc0382 {UniProtKB/Swiss-Prot:Q8ZJT3}"
Comment /locus_tag can be used with any feature where /gene_name is
valid; /locus_tag values may be used more than once within
an entry, but always to indicate the same gene; in all other
circumstances the /locus_tag value must be unique
within that entry/record; together with the contents
of the /gene_name qualifier, the /locus_tag qualifier is
(where known) applied to every feature derived from
the corresponding gene (see also the discussion on the
use of /gene, above);
Data source data in gene qualifiers in Genome Reviews files is
derived from UniProtKB, where the original EMBL-derived
locus tags may be supplemented, corrected, or deleted
by curators;
Qualifier /ncRNA_class=
Definition a structured description of the classification of the
non-coding RNA described by the ncRNA parent key
Value format "TYPE"
Example /ncRNA_class="snoRNA"
Comment where TYPE is one of the following terms: antisense_RNA, autocatalytically_spliced_intron,
hammerhead_ribozyme, RNase_P_RNA, RNase_MRP_RNA, telomerase_RNA, guide_RNA, rasiRNA, scRNA,
siRNA, miRNA, snoRNA, snRNA, SRP_RNA, stRNA, tRNA, vault_RNA, Y_RNA.
Qualifier /note=
Definition any comment or additional information
Value format "text <evidence_tag>"
Example /note="protein modification {FunCat:14.07}"
Qualifier /operon_name=
Definition name of the group of contiguous genes transcribed into a
single transcript to which that feature belongs.
Value format "text <evidence_tag>"
Example /operon_name="thrLABC {RegulonDB:ECK120014725}"
Comment valid only on Prokaryota-specific features. To accommodate
regulonDB data, we use the extended regulonDB definition of
operon, i.e. we allow single-gene operons.
Data source data in operon qualifiers is derived from regulonDB;
Qualifier /orf_name=
Definition A name temporarily attributed by a sequencing project to an
open reading frame. This name is generally based on a cosmid
numbering system.
Value format "text <evidence_tag>"
Example /orf_name="MTV025.058 {UniProtKB/Swiss-Prot:P96420}"
Data source data in orf_name qualifiers in Genome Reviews files is
derived from UniProtKB;
Qualifier /organism=
Definition scientific name of the organism that provided the
sequenced genetic material;
Value format "text"
Example /organism="Chlamydophila caviae"
Comment used only with the source feature key; in Genome Reviews, the
content of the organism qualifier contains only the genus and
species of the relevant organism; the complete taxonomic
specification of the source organism is provided by
combining the data stored under the following qualifiers
applied to the source feature: /biovar, /organism,
/pathovar, /serovar, /strain, /sub_species, /sub_strain; a fully
descriptive name based on all these qualifiers is given in
the OS line of the entry;
Data source a full taxonomic definition for each species in Genome
Reviews is imported from the HAMAP project;
Qualifier /pathovar=
Definition a strain or set of strains with similar pathogenicity,
including both host-range and symptomatology;
Value format "text"
Example /pathovar="campestris"
Comment used only with the source feature key;
Data source a full taxonomic definition for each species in Genome
Reviews is imported from the HAMAP project and is
combined with the original taxonomic annotation in the
parent EMBL record;
Qualifier /plasmid=
Definition name of plasmid from which sequence was obtained
Value format "text"
Example /plasmid="C-589"
Data source EMBL or UniProtKB
Qualifier /product=
Definition primary name of a product (typically a protein name)
encoded by a sequence
Value format "text <evidence_tag>"
Example /product="DNA polymerase III beta chain {UniProtKB/TrEMBL:Q8PEH4}"
Data source product names are imported from description lines in UniProtKB
records; these may apply to complete CDSs or to derived partial
sequences (e.g. mature peptides); where more than one name is
available, secondary names are stored under the
/product_synonym qualifier;
Qualifier /product_synonym=
Definition secondary name of a product (typically a protein name)'
encoded by a sequence
Value format "text <evidence_tag>"
Example /product_synonym="DNA polymerase III beta chain {UniProtKB/TrEMBL:Q8PEH4}"
Data source product names are imported from description lines in UniProtKB
records; these may apply to complete CDSs or to derived partial
sequences (e.g. those qualified with /mat_peptide);
/product_synonym is used to store secondary names where more than
one name is available;
Qualifier /promoter=
Definition name of region on a DNA molecule involved in RNA polymerase
binding to initiate transcription;
Value format "<text> <evidence_tag>"
Example /promoter="thrLp {RegulonDB:ECK120014725}"
Comment in Genome Reviews, this qualifier is used to uniquely
define transcription units that are part of an operon;
Data source data in promoter qualifiers is derived from regulonDB;
Qualifier /protein_id=
Definition protein identifier, issued by the International
Nucleotide Sequence Database collaborators EMBL, Genbank
and DDBJ; this qualifier consists of a stable ID
portion (3+5 format with 3 position letters and 5
numbers) plus a version number after the decimal point;
Value format "<identifier> <evidence_tag>"
Example /protein_id="AAA12345.1"
Comment when the protein sequence encoded by the CDS changes,
only the version number of the /protein_id value is
incremented; the stable part of the /protein_id
remains unchanged and as a result will permanently be
associated with a given protein; this qualifier is
valid only on CDS features which translate into a
valid protein; use of /protein_id in Genome Reviews is
unchanged from usage in EMBL;
Data source The original parent entry of this Genome Reviews entry
in the EMBL Nucleotide Sequence Database;
Qualifier /proviral
Definition if the sequence shown is viral and integrated into another
organism's genome, this qualifier is used to denote that
Value format none
Example /proviral
Qualifier /pseudo
Definition indicates that this feature is a non-functional
version of the element named by the feature key
Example /pseudo
Comment in Genome Reviews, the pseudo qualifier is used to
indicate that a CDS is non-coding (a pseudogene).
Data source EMBL or UniProtKB
Qualifier /replace=
Definition indicates that the sequence identified a feature's intervals
is replaced by the sequence shown in "text"; if no
sequence is contained within the qualifier, this indicates a
deletion.
Value format "text"
Example /replace="a" /replace=""
Qualifier /segment=
Definition name of viral or phage segment sequenced
Value format "text"
Example /segment="M"
Qualifier /serogroup=
Definition serological variety of a species
characterised by its antigenic properties; a variety of
different serovars may belong to a single serogroup;
Value format "text"
Example /serogroup="B"
Comment used only with the source feature key;
Data source a full taxonomic definition for each species in Genome
Reviews is imported from the HAMAP project and is
combined with the original taxonomic annotation in the
parent EMBL record;
Qualifier /serovar=
Definition serological variety of a species
characterised by its antigenic properties; "serotype" is a
synonym of "serovar" but serovar is the correct term;
Value format "text"
Example /serovar="3"
Comment used only with the source feature key;
Data source a full taxonomic definition for each species in Genome
Reviews is imported from the HAMAP project and is
combined with the original taxonomic annotation in the
parent EMBL record.
Qualifier /strain=
Definition strain from which sequence was obtained;
Value format "text"
Example /strain="NCTC 11168"
Data source a full taxonomic definition for each species in Genome
Reviews is imported from the HAMAP project;
Qualifier /sub_species=
Definition name of sub-species of organism from which sequence was
obtained;
Value format "text"
Example /sub_species="Acyrthosiphon pisum"
Data source a full taxonomic definition for each species in Genome
Reviews is imported from the HAMAP project and is
combined with the original taxonomic annotation in the
parent EMBL record;
Qualifier /sub_strain=
Definition sub_strain from which sequence was obtained;
Value format "text"
Example /sub_strain="abis"
Data source a full taxonomic definition for each species in Genome
Reviews is imported from the HAMAP project and is
combined with the original taxonomic annotation in the
parent EMBL record;
Qualifier /translation=
Definition one-letter abbreviated amino ;
acid sequence derived from either the universal
genetic code or the table as specified in
/transl_table and as determined by exceptions in the
/transl_except and /codon qualifiers;
Value format IUPAC one-letter amino acid abbreviation, "X" is to be
used for AA exceptions;
Example /translation="MASTFPPWYRGCASTPSLKGLIMCTW"
Comment to be used with CDS feature only; this is a mandatory
qualifier to the CDS feature key except for /pseudo
CDSs; see /transl_table for definition and location of
genetic code Tables; /translation is only included
for CDSs with valid translations (i.e. not pseudogenes;
usage is exclusive with /pseudo);
Data source at present, the translation is always imported from
the parent EMBL record of each Genome Reviews entry;
Qualifier /transl_except=
Definition translational exception: single codon the translation of which
does not conform to genetic code defined by Organism and /codon=
Value format "(pos:location,aa:<amino_acid>)" where amino_acid is the
amino acid coded by the codon at the base_range position
Example /transl_except="(pos:213..215,aa:Trp)"
/transl_except="(pos:1017,aa:TERM)"
/transl_except="(pos:2000..2001,aa:TERM)"
/transl_except="(pos:X22222:15..17,aa:Ala)"
Comment if the amino acid is not on the restricted vocabulary list use
e.g., '/transl_except="(pos:213..215,aa:OTHER)"' with
'/note="name of unusual amino acid"';
for modified amino-acid selenocysteine use three letter code
'Sec' (one letter code 'U' in amino-acid sequence)
/transl_except="(pos:1002..1004,aa:Sec)";
for partial termination codons where TAA stop codon is
completed by the addition of 3' A residues to the mRNA
either a single base_position or a base_range is used, e.g.
if partial stop codon is a single base:
/transl_except="(pos:1017,aa:TERM)"
if partial stop codon consists of two bases:
/transl_except="(pos:2000..2001,aa:TERM)".
Data source translation exceptions are either imported from the parent EMBL
record of each Genome Reviews entry, or derived from a mapping
process applied when comparing reference protein sequences to
the genomic DNA sequence.
Qualifier /transl_table=
Definition definition of genetic code table used if other than
universal genetic code table. Tables used are
described in appendix V of the EMBL feature table
document,section 7.5.5. (URL:
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html);
Value format integer
Example /transl_table=4
Comment genetic code exceptions outside range of specified
tables are reported in /codon or /transl_except qualifiers;
1=universal table 1; 2=non-universal
table 2; etc.; /transl_table is only included for CDSs
with valid translations (i.e. not pseudogenes; usage is
exclusive with /pseudo);
Data source the parent EMBL record of each Genome Reviews entry;
Qualifier /variety=
Definition variety (= varietas, a formal Linnaean rank) of organism
from which sequence was derived.
Value format "text"
Example /variety="neoformans"
8.3.2 Feature qualifiers - mapped to Feature keys The following is a list of available qualifiers mapped to the list of feature keys on which each qualifier is legal. /anticodon tRNA /biological_process CDS /biovar source /cellular_component CDS /chromosome source /citation conflict /cultivar source /db_xref CDS /db_xref mat_peptide /db_xref pro_peptide /db_xref peptide /db_xref sig_peptide /db_xref source /db_xref transit_peptide /EC_number CDS /EC_number mat_peptide /EC_number peptide /evidence mat_peptide /evidence pro_peptide /evidence peptide /evidence sig_peptide /evidence transit_peptide /focus source /function CDS /gene_id CDS /gene_name CDS /gene_name mat_peptide /gene_name pro_peptide /gene_name peptide /gene_name sig_peptide /gene_name transit_peptide /gene_name rRNA /gene_name tRNA /gene_name tmRNA /gene_name ncRNA /gene_synonym CDS /host source /host_range source /locus_tag CDS /locus_tag mat_peptide /locus_tag pro_peptide /locus_tag peptide /locus_tag rRNA /locus_tag sig_peptide /locus_tag transit_peptide /locus_tag tRNA /mol_type source /ncRNA_class ncRNA /note CDS /note gap /orf_name CDS /organism source /operon_name CDS /operon_name prim_transcript /operon_name operon /pathovar source /plasmid source /product CDS /product mat_peptide /product pro_peptide /product peptide /product sig_peptide /product transit_peptide /product_synonym CDS /product_synonym mat_peptide /product_synonym pro_peptide /product_synonym peptide /product_synonym sig_peptide /product_synonym transit_peptide /promoter prim_transcript /protein_id CDS /proviral source /pseudo CDS /pseudo tRNA /replace conflict /segment source /serogroup source /serovar source /strain source /sub_species source /sub_strain source /transl_table CDS /translation CDS /variety source 8.4 APPENDIX IV. Full list of all evidence tags currently in Use in Genome Reviews.
Tag: BLASTALL 2.2.6/ALIGN 2.0u
Comment In CDS features added after comparing sequences from
the UniProt Knowledgebase to Genome Reviews DNA sequence,
applied to the cross-reference to the UniProtKB entry.
Tag: tRNAScan-SE-1.23
Comment Applied to qualifiers of tRNA feature added after running this program to
predict tRNA-encoding genes for records where these are not
available from the primary sequence source.
Tag: Rfam-8.1
Comment Applied to qualifiers of all non-protein-coding RNA features,
other than tRNA and rRNA genes and to all RNA motif features
added after running the rfam_scan.pl program to predict
non-protein-coding RNA genes and RNA motifs for
records where these are not available from the primary sequence source.
Tag: RNAmmer-1.2
Comment Applied to qualifiers of rRNA genes added after running
RNAmmer version 1.2 to predict ribosomal RNA genes in
records where these are not available from the primary sequence source.
Tag: EMBL:accession_number
Comment Applied to data automatically imported into Genome Reviews
from a source EMBL entry.
Tag GOA:accession_number
Comment GOA is a database of associations between terms in the
Gene Ontology controlled vocabularies and records in the
UniProtKB Knowledgebase. Annotations are retrieved through
mapping via cross references to UniProtKB present in
CDS features in the parent EMBL entry.
Tag GO:id
Comment Applied to data that follows from the mapping made between
a feature and a particular GO term via GOA.
Tag MUMDB:id
Comment Applied to data that is imported for a feature using its
identifier in the MUMDB database.
Tag RefSeq:id
Comment Applied to data that is imported for a feature using its
identifier in the RefSeq database.
Tag SGD:id
Comment Applied to data that is imported for a feature using its
identifier in the SGD database.
Tag SGD genome:id
Comment Applied to data automatically imported into Genome Reviews
from a source SGD entry.
Tag TAIR:id
Comment Applied to data that is imported for a feature using its
identifier in the TAIR database.
Tag TAIR release: release_number
Comment Applied to data automatically imported into Genome Reviews
from a given release of TAIR.
Tag UniProtKB/Swiss-Prot:accession_number
Comment Applied to data that is imported for a feature using its
identifier in the UniProtKB/Swiss-Prot database.
Tag UniProtKB/TrEMBL:accession_number
Comment Applied to data that is imported for a feature using its
identifier in the UniProtKB/Swiss-Prot database.
Tag UniParc:protein_id
Comment Uniparc is a database of associations between protein
sequences (identified by the use of a UniParc
identifier) and records in external databases
(including CDSs in EMBL records, identified by the use of
the protein identifier, which can be used to retrieve
UniParc IDs for each feature).
The presence of an exclamation mark (!) before the database
identifier indicates that a deduction has been made from the absence of
this identifier from the database in question.
8.5 APPENDIX V: List of cross-references currently included in Genome Reviews
![]() |