spacer

Genome Reviews User Manual


User Manual Release 4.2 December 2008

EMBL Outstation
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

Telephone: +44-1223-494400
Telefax : +44-1223-494468

Electronic mail: support@ebi.ac.uk
URL: http://www.ebi.ac.uk/GenomeReviews

This manual and the database it accompanies may be copied and redistributed freely,
without advance permission, provided that this statement is reproduced with each copy.

Table of contents

1) INTRODUCTION

Genome Reviews contains information about complete DNA molecules (chromosomes and plasmids), genes, transcripts and proteins, for complete genomes from bacteria, bacteriophage and selected eukaryota.

Genome Reviews records are normally constructed by modifying the sequence and annotation of an entry deposited in the EMBL/Genbank/DDBJ sequence repository using data imported from other resources or calculated by sequence analysis. However, for some species, an alternative database (for example, a model organism database) may be used as the primary source of the sequence.

Molecules, genes and transcript entities are assigned stable identifiers that are maintained between releases; and are mapped to the identifiers from the UniProt Knowledgebase which describe the corresponding protein product. More details about the Genome Reviews gene and transcript records can be found respectively in section 5 and section 6 of this user manual.

Data files describing complete DNA molecules, or the set of records, comprising the genes and transcripts derived from each molecule (and each complete genomes), can be downloaded for all genomes in Genome Reviews. All data is available in FASTA format, and additionally in richer file formats containing more detailed, structured annotation: complete molecule records in Genome Reviews EMBL-like format; gene and transcript records in Genome Reviews EMBL CDS-like format. A complete description of EMBL format is available in the EMBL user manual (URL: http://www.ebi.ac.uk/embl), to which this document serves as a supplement. Where appropriate, reference is made to that document in this one. A description of the EMBL CDS format can be found here (URL: ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/README.txt).

Complete genome records are available for download from the Genome Reviews FTP site; gene, transcript and protein records are available from the Integr8 FTP site.

For searching Genome Reviews, an Ensembl-style browser is available, which provides zoomable graphical views of all chromosomes and plasmids represented in the database (see section 7.2). For information about the MySQL relational dump, see section 7.3.

This document describes mainly the (small) changes in format between the Genome Reviews and EMBL formats, and the procedure by which the annotation in the Genome Reviews database is derived from primary data sources. The main body of this User Manual describes the features of the database and file format which will remain stable. Information which applies specifically to the current release of the database is presented in the Release Notes. The Release Notes also describe changes which are foreseen in future releases.

It is likely that the need to represent new kinds of information in the database will necessitate changes or additions to the presentation of data. Such changes will be made as far as possible in ways which have minimal impact on user programs and procedures, and which maximise the compliance of Genome Reviews files with EMBL format.

1.1 Citation

Users of Genome Reviews should cite the following publication:

Kersey P., Bower L., Morris L., Horne A., Petryszak R., Kanz C., Kanapin A., Das U., Michoud K., Phan I., Gattiker A., Kulikova T., Faruque N., Duggan K., McLaren P., Reimholz B., Duret L., Penel S., Reuter I., Apweiler R.

Integr8 and Genome Reviews: integrated views of complete genomes and proteomes.

Nucleic Acids Research Jan 1; 33 (Database Issue): D297-D302 (2005).

1.2 Mailing list

Users who wish to be kept informed about changes and new developments should subscribe to the Genome Reviews mailing list genomereviews-announce@ebi.ac.uk at URL: http://listserver.ebi.ac.uk/mailman/listinfo/genomereviews-announce. Previous postings to the mailing list can be viewed through the link Genomereviews-announce Archives on the same page.





2) CONVENTIONS USED IN THE DATABASE



This section describes the general conventions which have been applied to the information in the database in order to achieve uniformity of presentation.

Specific abbreviations and symbol usage are summarised in the appendices.

2.1 Sequence Data

The same conventions apply as in the EMBL Nucleotide Sequence Database.

2.2 Organism Identification and Classification

The unified taxonomy used by the collaborating databases DDBJ/EMBL/GenBank is re-used in Genome Reviews. The taxonomic information relevant to the entry is described in the OS and OC lines of the entry, and the primary source feature (which describes the origin of the sequence). However, alternative names for individual taxonomic nodes may be used, according to the conventions used in the HAMAP project (URL: http://www.expasy.org/sprot/hamap/ for further details). Also, some further standardisation is applied, with the the node descriptors 'biotype' and 'serotype' being replaced by their synonyms 'biovar' and 'serovar'. Taxonomic information appears in three places in each Genome Reviews file:

  • In the primary source feature of Genome Reviews, the different levels of taxonomic classification ("organism" (= genus + species), sub_species, strain, substrain, serogroup, biovar, cultivar, pathovar, serovar) are presented in a regular manner using separate feature qualifiers. The source feature also contains a cross-reference to the unified taxonomy database, via an identifier that corresponds to the taxonomic node described collectively by all the other relevant qualifiers of the source feature.
  • The OS line of each entry combines all the information in the relevant qualifiers of the source feature to present a full descriptive name for each organism.
  • The OC line describes the lineage of that organism in the taxonomic tree, describing the organism's parent nodes down to the level of genus.


For example, in entry AE009952_GR.dat, the organism was identified as follows:


OS   Yersinia pestis (biovar Mediaevalis, strain KIM5)
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
OC   Enterobacteriaceae; Yersinia
XX
FT   source          1..4600755
FT                   /chromosome="Chromosome"

FT                   /organism="Yersinia pestis"
FT                   /biovar="Mediaevalis"

FT                   /strain="KIM5"
FT                   /mol_type="genomic DNA"

FT                   /db_xref="taxon:187410"

The full name is represented in the OS line; the genus and species are given as the value associated with the "organism" qualifier of the source feature; the strain and biovar represented in separate qualifiers; and the cross-reference identifies the most precise taxonomic node available to describe this organism based on all the above information.

Note that an entry may have more than one source feature: in this case, the primary source feature is distinguished by use of the /focus feature qualifier. Secondary source features may describe insertion sequences within the main sequence. Data associated with taxonomic feature qualifiers in secondary source features is not changed in Genome Reviews (compared to the original submission).

2.3 Literature References

Literature references are presented for each entry. If a sequence has been submitted to EMBL/Genbank/DDBJ prior to publication, the submission itself is acknowledged. When a paper is subsequently published, such acknowledgements are usually removed.

2.4 Evidence tags

The most significant change to EMBL format in Genome Reviews concerns the introduction of evidence tags. Evidence tags describe the source of information that has been imported into Genome Reviews files. EMBL format supports only the attachment of evidence to features (through the use of the /evidence feature qualifier, but not to feature qualifiers themselves, hence the need for the introduction of a mechanism to describe the evidence for the attachment of an individual qualifier to a feature. This new tagging format has also been applied within the existing /evidence qualifier, where it describes the source of the information that has led to the inclusion of an additional feature in an entry.

Evidence tags have been applied to feature qualifiers and features. The use of evidence tags may be extended to other data items within the entry at a later date. Additional information about how data is imported into Genome Reviews files can be found in Appendix III of this document.

2.4.1. Evidence Tag Terms and Concepts

A Genome Reviews file is usually derived from a primary source entry in the EMBL Nucleotide Sequence Database, although sometimes it may be derived from an alternative source (see section 4.1 for more information). The identity of the source entry is given in the CC lines of the corresponding Genome Reviews entry (see section 3.4.15). The identity of the source database entry is also indicated by the Genome Reviews entry name and accession number (which are derived from the accession number of the parent entry: see sections 3.4.1 and 3.4.2).

However, a Genome Reviews file may also contain data imported from other sources. This is possible because:



  • a cross-reference from the original parent entry to another data resource identifies an element of that resource that describes the same biological entity as a biological entity described in (an element of) the parent entry; and data can thereby be imported from (that element of) the external resource into the Genome Reviews entry.
  • (an element of) another data resource, describing a biological entity, contains a cross-reference to (an element of) the parent entry; and data can thereby be imported from that resource into the Genome Reviews entry.
  • (an element of) another data resource, describing a biological entity, contains a cross-reference to (an element of) the parent entry, and a cross reference to (an element of) a third data resource; and data can thereby be imported from that third resource into the Genome Reviews entry.


and so on.

An "evidence tag", attached to an element of a Genome Reviews entry, provides a pointer to (an element of) an external resource from which the tagged data was derived. A given tagged item may have one or many evidence tags: each tag provides independent evidence for the inclusion of the data item in the Genome Reviews entry.

In example (iii) above, the concept of "evidence chaining" is introduced, whereby a series of databases are used to derive evidence that can be added to a Genome Reviews entry. The most obvious real example is that of GO annotations added to CDS features in Genome Reviews records according to the InterPro classification of a protein sequence. These are derived from the existence of cross-references from EMBL features (stored in the EMBL database) to records in the UniProt Knowledgebase (UniProt KB); the existence of cross-references between UniProt and InterPro (stored in the InterPro database); and the existence of cross-references between InterPro and GO (stored in the GOA database). Taking this information together allows terms from the GO controlled vocabulary to be propagated to Genome Reviews. In the case of Genome Reviews records where the primary source data does not come from EMBL, an extra step in the mapping procedure may be necessary (see section 4.2 for details).

The evidence tag model used for Genome Reviews does not support evidence chaining. A flat file distribution format is not suitable for describing complex chains of inference, whose structure may vary according to the data source. In Genome Reviews, for each chain of inference a single source is always identified as the most appropriate to present in the corresponding evidence tag; reference to this source should provide further information to reveal the complete chain of inference. More details about the form the chain of inference implied by evidence tags in individual contexts can be found in Appendix IV of this document.

An evidence tag may contain a reference to the particular element of the resource from which the tagged data item was derived, as well as to the resource itself. Typically this element takes the form of the identifier of an "entry" in the resource. In some cases, the concept of "entry" is not applicable in a particular resource, or the inclusion of entry-level information would be redundant, or the resource itself is a secondary repository of data and it is desirable to propagate the evidence presented in that resource to Genome Reviews. Some examples are given in the following section. A full description of what values are associated with each database that can be mentioned in a tag is given in Appendix III of this document.

2.4.2. Format of Evidence Tags

Here are some examples of evidence tags. As explained in the previous section, evidence tags have so far been applied only to feature qualifiers and features. A tag applying to a qualifier is incorporated in that qualifier and provides evidence for the addition of this qualifier to this feature. A tag applied to a feature is presented as the value of an additional '/evidence' qualifier added to the feature and provides the evidence for the inclusion of the feature in the entry.. For a full description of how evidence tags are added to feature qualifier lines, see section 3.4.16. For the purposes of this section, consider only the tags themselves. All tags applied to a single data item are listed between a single pair of curly braces, separated (where necessary) by a semi-colon and a space.

FT                   /product="Hypothetical protein Xfb0002 {UniProtKB/TrEMBL:Q9PHK5}

FT                   /pseudo="{UniParc:!AAD06288}

FT                   /evidence="{UniProtKB/TrEMBL:Q9PHK5}"

FT                   /product="Hypothetical protein Xfb0002 {UniProtKB/TrEMBL:Q9PHK5;
                      UniProtKB/Swiss-Prot:P12345}"

An evidence tag always identifies a source database; and may additionally identify elements of that database specifically linked to the data item. The contents of a tag are determined separately for each database referred to, in order to provide the most relevant and useful information. The contents of each existing type of reference are described in Appendix IV.

In terms of format, there are two possible models for an evidence tag, representing a tags in which none, one or many pieces of information from the source database are given. These are illustrated by the first three examples above.

  • Database reference plus single data value, e.g.

    FT                   /product="Hypothetical protein Xfb0002 {UniProtKB/TrEMBL:Q9PHK5}"

    In this case, the name of the product encoded by a CDS feature has been imported from an entry in the UniProt database. The identifier of the UniProtKB/TrEMBL entry is given as a single data value after the database name.

  • Database reference plus exclamation mark plus single data value.
    FT                   /pseudo="{UniParc:!AAD06288}"

    The exclamation mark is used to indicate that information has been inferred through absence rather than presence, i.e. in this case, the absence of the protein_id (AAD06288) in the database UniParc has been used to infer that that the qualifier "pseudo" should be added to the feature.

In the second and third examples above:

FT                   /pseudo="{UniParc:!AAD06288}"
FT                   /evidence="{UniProtKB/TrEMBL:Q9PHK5}"

the evidence tag is presented as the sole value associated with a qualifier that contains no intrinsic value of its own.

In the third example above:

FT                   /evidence="{UniProtKB/TrEMBL:Q9PHK5}"

the evidence tag is attached to the "evidence" qualifier, and there is no additional data associated with this qualifier besides the tag. This indicates that this tag provides information about the source of the feature to which this qualifier has been attached. This is the only allowable use of the "evidence" qualifier in Genome Reviews.

In the fourth example above:

FT                   /product="Hypothetical protein Xfb0002 {UniProtKB/TrEMBL:Q9PHK5;
                     UniProtKB/Swiss-Prot:P12345}"

more than one entry/chains of records have been used to independently infer that a particular feature qualifier should be added. Evidence for each independent inference is given in its own tag, separated by the use of a semi-colon and a following space. The list of all tags is collectively surrounded by a pair of curly braces.

A formal description of the format of evidence tags is given in Appendix IV of this document.

2.4.3 Evidence Tag Removal

For users who do not wish to filter information by source, a program is provided with this release to remove evidence tags from Genome Reviews files, resulting in the production of "normal" EMBL format files. This program is written in the Java programming language and will run on any platform on which a Java runtime environment has been installed. Such environments are available free of charge for many platforms (including Microsoft Windows, Mac OS and GNU/Linux) from either Sun Microsystems (URL: http://java.sun.com/j2se/index.html or your hardware vendor. The tag removal program itself is available:


If you choose to download the tar archive, untar it as follows:

tar -xvf RemoveEvidenceTags.tar

If you choose to download the raw source code, you will need to copy the complete directory structure uk/ac/ebi/genomeReviews/

You will then need to compile the java class:

javac uk/ac/ebi/genomeReviews/RemoveEvidenceTags.java

Run the compiled code using, either:

java -cp . uk/ac/ebi/genomeReviews/RemoveEvidenceTags <directory>

or:

java -cp . uk/ac/ebi/genomeReviews/RemoveEvidenceTags <directory> <file-name>

Alternatively the program can be run from the executable jar
(RemoveEvidenceTags.jar) as follows:

java -jar RemoveEvidenceTags.jar <directory>

or

java -jar RemoveEvidenceTags.jar <directory> <file-name>

where <directory> is the path to the directory where the Genome Reviews files are located, and <file-name> is the name of a Genome Reviews file contained in this directory. If only the single parameter <directory> is used, then the program with remove the evidence tags from ALL Genome Reviews files located in that directory.

Usage information can be generated by typing

java -jar RemoveEvidenceTags.jar

or

java -cp . uk/ac/ebi/GenomeReviews/RemoveEvidenceTags






3) FORMAT OF THE DATABASE


3.1 Classes of Data

The class of each entry is indicated on the first (ID) line of the entry. For Genome Reviews, records distributed and made publicly available are of data class 'GRV':

Class  Definition
-----  -----------------------------------------------------------
GRV Publicly available Genome Reviews Database records.


3.2 Database Divisions

The records which constitute the EMBL Nucleotide Sequence Database are grouped into divisions. The ID line of each entry indicates its division, using three letter codes. Currently, Genome Reviews records fall into one of four of these divisions:

Code   Division
-----  ----------------------
FUN    Fungi
PHG    Bacteriophage
PLN    Plants
PRO    Prokaryotes


3.3 Structure of an Entry

The structure of a Genome Reviews "component" record (describing a completely sequenced chromosome or plasmid that forms all or part of a completely sequenced genome) mirrors that of an record in the EMBL Nucleotide Sequence Database. The line types of a Genome Reviews record are all legitimate EMBL line types (although some legitimate line types in EMBL have been removed from Genome Reviews files), and, as far as possible, uses the same features and feature qualifiers. In some cases it has been necessary to add new features or feature qualifiers, or to redefine the meaning of an existing term, in order to support the concepts of Genome Reviews.

Genome Reviews data is also available as "gene sets": sets of gene records each representing one gene present in a Genome Reviews component record. The format for a gene record is based on the format used in the EMBL CDS database (which itself is closely related to the main EMBL record format). Certain line types are missing, and an additional line type, the PA line, is added. A description of the EMBL CDS format can be found here (URL: ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/README.txt).

The following line types do not appear in Genome Reviews records.

DR - database cross-reference
AH - assembly header
AS - assembly information
CO - contig/construct line


AH, AS and CO lines appear in certain types of EMBL records to describe information that is not considered relevant to Genome Reviews records. DR lines are excluded because cross-references against features of the entry are given with greater precision by applying the db_xref qualifier to the feature itself, rather than the whole entry.


ID AE003850_GR; SV 3; circular; genomic DNA; GRV; PRO; 1286 BP. XX AC AE003850_GR; XX DT 18-FEB-2004 (Rel. 0.1, Created) DT 23-OCT-2007 (Rel. 82, Last updated, Version 55) XX DE Xylella fastidiosa (strain 9a5c) plasmid pXF1.3, complete sequence. XX KW complete genome; genome reviews. XX OS Xylella fastidiosa (strain 9a5c) OC Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; OC Xanthomonadaceae; Xylella. OG Plasmid pXF1.3 XX RN [1] RX PUBMED; 10910347. RA Simpson A.J.G., Reinach F.C., Arruda P., Abreu F.A., Acencio M., RA Alvarenga R., Alves L.M.C., Araya J.E., Baia G.S., Baptista C.S., Barros RA M.H., Bonaccorsi E.D., Bordin S., Bove J.M., Briones M.R.S., Bueno M.R. RA P., Camargo A.A., Camargo L.E.A., Carraro D.M., Carrer H., Colauto N.B., RA Colombo C., Costa F.F., Costa M.C.R., Costa-Neto C.M., Coutinho L.L., RA Cristofani M., Dias-Neto E., Docena C., El-Dorry H., Facincani A.P., RA Ferreira A.J.S., Ferreira V.C.A., Ferro J.A., Fraga J.S., Franca S.C., RA Franco M.C., Frohme M., Furlan L.R., Garnier M., Goldman G.H., Goldman M. RA H.S., Gomes S.L., Gruber A., Ho P.L., Hoheisel J.D., Junqueira M.L., RA Kemper E.L., Kitajima J.P., Krieger J.E., Kuramae E.E., Laigret F., RA Lambais M.R., Leite L.C.C., Lemos E.G.M., Lemos M.V.F., Lopes S.A., Lopes RA C.R., Machado J.A., Machado M.A., Madeira A.M.B.N., Madeira H.M.F., RA Marino C.L., Marques M.V., Martins E.A.L., Martins E.M.F., Matsukuma A. RA Y., Menck C.F.M., Miracca E.C., Miyaki C.Y., Monteiro-Vitorello C.B., RA Moon D.H., Nagai M.A., Nascimento A.L.T.O., Netto L.E.S., Nhani A. Jr., RA Nobrega F.G., Nunes L.R., Oliveira M.A., de Oliveira M.C., de Oliveira R. RA C., Palmieri D.A., Paris A., Peixoto B.R., Pereira G.A.G., Pereira H.A. RA Jr., Pesquero J.B., Quaggio R.B., Roberto P.G., Rodrigues V., de Rosa A.J. RA M., de Rosa V.E. Jr., de Sa R.G., Santelli R.V., Sawasaki H.E., da Silva RA A.C.R., da Silva A.M., da Silva F.R., Silva W.A. Jr., da Silveira J.F., RA Silvestri M.L.Z., Siqueira W.J., de Souza A.A., de Souza A.P., Terenzi M. RA F., Truffi D., Tsai S.M., Tsuhako M.H., Vallada H., Van Sluys M.A., RA Verjovski-Almeida S., Vettore A.L., Zago M.A., Zatz M., Meidanis J., RA Setubal J.C. RT "The genome sequence of the plant pathogen Xylella fastidiosa."; RL Nature 406:151-159(2000). XX CC This Genome Reviews entry was created from entry AE003850.3 in the CC EMBL/Genbank/DDBJ databases on 23 October 2007. XX FH Key Location/Qualifiers FH FT source 1..1286 FT /organism="Xylella fastidiosa" FT /strain="9a5c" FT /mol_type="genomic DNA" FT /plasmid="Plasmid pXF1.3" FT /db_xref="taxon:160492" FT CDS complement(19..870) FT /codon_start=1 FT /evidence="4: Predicted {UniProtKB/Swiss-Prot:Q9PHK6}" FT /gene_id="IGI00144239" FT /locus_tag="XF_b0001 {UniProtKB/Swiss-Prot:Q9PHK6}" FT /product="Putative replication protein XF_b0001 FT {UniProtKB/Swiss-Prot:Q9PHK6}" FT /biological_process="DNA replication {GO:0006260}" FT /protein_id="AAF85568.1 {EMBL:AE003850}" FT /db_xref="GO:0006260 {GOA:Q9PHK6}" FT /db_xref="HOGENOM:HBG539871 {HogenProt:Q9PHK6}" FT /db_xref="UniParc:UPI00000C223E {EMBL:AAF85568}" FT /db_xref="UniProtKB/Swiss-Prot:Q9PHK6 {EMBL:AE003850}" FT /transl_table=11 FT /translation="MPVITVYRHGGKGGVAPMNSSHIRTPRGEVQGWSPGAVRRNTEFL FT MSVREDQLTGAGLALTLTVRDCPPTAQEWQKIRRAWEARMRRAGMIRVHWVTEWQRRGV FT PHLHCAIWFSGTVYDVLLCVDAWLAVASSCGAGLRGQHGRIIDGVVGWFQYVSKHAARG FT VRHYQRCSENLPEGWKGLTGRVWGKGGYWPVSDALRIDLQDHRERGDGGYFAYRRLVRS FT WRVSDARSSGDRYRLRSARRMLTCSDTSRSRAIGFMEWVPLEVMLAFCANLAGRGYSVT FT SE" FT CDS complement(922..1176) FT /codon_start=1 FT /evidence="4: Predicted {UniProtKB/TrEMBL:Q9PHK5}" FT /gene_id="IGI00721559" FT /locus_tag="XF_b0002 {UniProtKB/TrEMBL:Q9PHK5}" FT /product="Putative uncharacterized protein FT {UniProtKB/TrEMBL:Q9PHK5}" FT /protein_id="AAF85569.1 {EMBL:AE003850}" FT /db_xref="HOGENOM:HBG539872 {HogenProt:Q9PHK5}" FT /db_xref="UniParc:UPI00000C223F {EMBL:AAF85569}" FT /db_xref="UniProtKB/TrEMBL:Q9PHK5 {EMBL:AE003850}" FT /transl_table=11 FT /translation="MSPGWKSRFPVVSNWMTFSSTPLDLTSSDCLTNCHWGIEISPAFE FT RAKRVRNGGDFHPPLISGYISVRFVKVGGFSSFLHAAKK" XX SQ Sequence 1286 BP; 351 A; 400 C; 314 G; 220 T; 1 other; ggtacccccc acacccccct actcgctcgt aactgagtac ccacgaccgg ctaggttcgc 60 gcaaaaggcc aacatgacct ctaggggaac ccactccatg aagccaatgg cacgagaacg 120 ggaggtatcg ctacaggtga gcatcctacg agcactacgg agccgataac gatcacccga 180 gctgcgagcg tctgagacgc gccaggagcg caccaaacgg cgataagcga aatacccccc 240 atcaccacgc tcacgatgat cctgtagatc gatacgaagg gcatcagaca caggccaata 300 gccaccctta ccccaaacac ggcccgtaag ccctttccag ccttcaggga gattctcaga 360 acaacgctgg taatggcgca cgcctcgggc ggcgtgcttg ctcacgtact gaaaccatcc 420 gacaacccca tcaataatcc gaccatgctg cccacgcaga ccagcaccac aggaggacgc 480 aacagccaac cacgcatcga cgcatagaag cacatcgtaa acagtgccag aaaaccagat 540 agcacaatgc aaatgcggga cacctcgacg ctgccactcc gtcacccagt gaaccctgat 600 cataccagca cgcctcatgc gagcttccca cgcacgcctg attttctgcc actcctgagc 660 agtaggaggg caatcacgaa cggtaagggt caaagcgaga ccagcgcccg ttaactgatc 720 ctcacgaacg gacatgagga actctgtatt gcgacggaca gccccaggag accacccctg 780 aacctcgcct cgtggcgtcc tgatatgtga tgagttcatg ggagcaacac caccttttcc 840 cccatgacgg taaactgtaa ttactggcat cggcctctcc gatagctggt cacgaccccg 900 ggtgctcgta acaccgcggg gttatttttt tgccgcatgc aggaaggagg aaaaaccccc 960 aaccttaaca aaacgtacag atatgtaacc actaatcaag ggaggatgga aatccccccc 1020 gtttcgcact cgcttcgctc gctcaaaagc gggggagatt tctattcccc aatgacaatt 1080 tgtcaagcaa tcacttgacg ttaaatccaa gggggttgaa ctgaatgtca tccaattgga 1140 gaccactgga aacctagatt tccacccagg ggacacaggg cgtaaaaacg gttatccgtg 1200 aaatagatca gggcttcgtg ttgggggtca tttggccccc acataacgga ccgaaggaga 1260 gggcgtaaaa gcgcctccgc aggggn 1286 //


Figure 1 - A sample record from the database


ID   IGI00270102; SV 1; linear; genomic DNA; GRV; PRO; 207 BP.
XX
PA   AP001918_GR.1
XX
DE   srnB
XX
OS   Escherichia coli (strain K12)
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
OC   Enterobacteriaceae; Escherichia.
OG   Plasmid F
OX   NCBI_TaxID=83333;
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..207
FT                   /organism="Escherichia coli"
FT                   /strain="K12"
FT                   /mol_type="genomic DNA"
FT                   /plasmid="Plasmid F"
FT                   /db_xref="taxon:83333"
FT   CDS             1..207
FT                   /codon_start=1
FT                   /gene_name="srnB {UniProtKB/Swiss-Prot:P13970}"
FT                   /locus_tag="ECOK12F004 {UniProtKB/Swiss-Prot:P13970}"
FT                   /product="Protein srnB {UniProtKB/Swiss-Prot:P13970}"
FT                   /cellular_component="integral to membrane {GO:0016021}"
FT                   /protein_id="BAA97874.1 {EMBL:AP001918}"
FT                   /db_xref="EMBL:AAA98078.1 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="EMBL:AAA99006.1 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="EMBL:CAA32614.1 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="EcoGene:EG40018 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="GO:0016021 {GOA:P13970}"
FT                   /db_xref="HOGENOM:HBG270039 {HogenProt:P13970}"
FT                   /db_xref="InterPro:IPR000021 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="UniParc:UPI0000135F51 {EMBL:BAA97874}"
FT                   /db_xref="UniProtKB/Swiss-Prot:P13970 {EMBL:AP001918}"
FT                   /transl_table=11
FT                   /translation="MKYLNTTDCSLFLAERSKFMTKYALIGLLAVCATVLCFSLIFRER
FT                   LCELNIHRGNTVVQVTLAYEARK"
FT   CDS             58..207
FT                   /codon_start=1
FT                   /gene_name="srnB {UniProtKB/Swiss-Prot:P13970}"
FT                   /locus_tag="ECOK12F005 {UniProtKB/Swiss-Prot:P13970}"
FT                   /product="Protein srnB {UniProtKB/Swiss-Prot:P13970}"
FT                   /cellular_component="integral to membrane {GO:0016021}"
FT                   /protein_id="BAA97875.1 {EMBL:AP001918}"
FT                   /db_xref="EMBL:AAA98078.1 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="EMBL:AAA99006.1 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="EMBL:CAA32614.1 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="EcoGene:EG40018 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="GO:0016021 {GOA:P13970}"
FT                   /db_xref="HOGENOM:HBG270039 {HogenProt:P13970}"
FT                   /db_xref="InterPro:IPR000021 {UniProtKB/Swiss-Prot:P13970}"
FT                   /db_xref="UniParc:UPI0000161C9D {EMBL:BAA97875}"
FT                   /db_xref="UniProtKB/Swiss-Prot:P13970 {EMBL:AP001918}"
FT                   /transl_table=11
FT                   /translation="MTKYALIGLLAVCATVLCFSLIFRERLCELNIHRGNTVVQVTLAY
FT                   EARK"
XX
SQ   Sequence 207 BP; 53 A; 40 C; 55 G; 59 T; 0 other;
     atgaagtacc ttaacactac tgattgtagc ctcttccttg cagagaggtc aaagtttatg        60
     acgaaatatg cccttatcgg gttgctcgcc gtgtgcgcta cggtgttgtg tttttcactg       120
     atattcaggg aacggttatg tgagctgaat attcacaggg gaaatacagt ggtgcaggta       180
     actctggcct acgaagcacg gaagtaa                                           207
//


Figure 2 - A sample gene record from the database





3.4 Line Structure

This section describes in detail the use made by Genome Reviews flat files of each line type as defined in the EMBL format.

3.4.1 The ID Line

The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:

ID entryname; sequence version; topology; molecule; data class; division; sequence length BP.

Entryname: stable identifier, consisting of alphanumeric character, starting with a letter. All letters should be in upper case. The entryname is provided only for reasons of compatibility with EMBL format and is redundant with the accession number (see section 3.4.2) in all Genome Reviews files.

Sequence version: The second item on the ID line indicates the sequence version, e.g. SV 2. The initial version number assigned to a Genome Reviews entry depends on the primary source of the sequence used in making that entry.

  • If the primary source of a Genome Reviews entry was an entry in the EMBL database, the Genome Reviews version number is determined by the version number of that entry, and is not incremented separately. In the case of EMBL records whose sequence version has already been incremented at the time of their first release in Genome Reviews, the initial Genome Reviews version number will be greater than 1.
  • If primary source of the Genome Reviews entry was information derived from another database, a different rule is applied. A sequence version of 1 is assigned on creation of the Genome Reviews entry and subsequently incremented as appropriate.

Topology: The third item on the ID line indicates the topology of the sequenced molecule, either 'linear' or 'circular'.

Molecule Type: The fourth item on the line is the type of molecule as stored. In the case of Genome Reviews files, this is always 'genomic DNA.

Data class: The fifth item on the ID line indicates the data class of the entry, always 'GRV' for Genome Reviews files.

Taxonomic database division: This 3-letter code designates the Genome Reviews taxonomic division of the genome, currently PRO (prokaryotes), PHG (bacteriophage), PLN (plants) or FUN (fungi).

Sequence length: The last item on the ID line is the length of the sequence (the total number of bases in the sequence). This number includes base positions reported as present but undetermined (coded as "n").

An example of a complete identification line is shown below:

ID   AE003850_GR; SV 3; circular; genomic DNA; GRV; PRO; 1286 BP.

3.4.2 The AC Line

The AC (ACcession number) line lists the accession numbers associated with this entry.

An example of a Genome Reviews accession number line is shown below:

AC   AE003850_GR;

Each accession number is terminated by a semicolon. Where necessary, additional AC lines are used.

The Genome Reviews accession number comprises the characters of the accession number of the EMBL entry from which the Genome Reviews entry is derived suffixed by '_GR'.

In some cases, an EMBL entry that had represented a particular chromosome or plasmid is supplemented by a new submission (with a new accession number) that represents a re-annotation or re-sequencing of the same biological molecule. When this happens, a new Genome Reviews entry will be produced (with an AC based on the new EMBL entry), but the old Genome Reviews accession will be added as a secondary accession number, to indicate that both records describe the same molecule.

In the case of Genome Reviews records derived from alternative data sources,an EMBL accession number is still used as the prefix in the Genome Reviews entry, if an EMBL entry exists that describes the same molecule as the Genome Reviews entry (e.g. the third chromosome of the budding yeast Saccharomyces cervisiae is represented in EMBL by the entry whose accession number is X59720, and the accession number of the corresponding Genome Reviews is X59720_GR, although this particular Genome Reviews entry is used data from the Saccharomyces Genome Database as its primary data source. chromosomes or plasmids where there is no corresponding EMBL entry

3.4.3 The PA Line

The PA (Parent Accession) line indicates the accession number (and version) of the parent Genome Reviews component record from which a gene record is derived. The format for a PA line is

PA   AP001918.1;
where the accession number is given before the '.' character, and the sequence version of that accession is given afterwards.

3.4.4 The DT Line

The DT (DaTe) line shows when an entry first appeared in the database and when it was last updated. Each entry contains two DT lines, formatted as follows:

DT   DD-MON-YYYY (Rel. #, Created)
DT   DD-MON-YYYY (Rel. #, Last updated, Version #)

The DT lines from the above example are:

DT   18-FEB-2004 (Rel. 0.1, Created)
DT   26-SEP-2005 (Rel. 36, Last updated, Version 41)

The second line indicates the last time that the contents of an entry was changed. It also contains the entry version, which is incremented each time that an entry is modified. The rules for the incrementation of entry versions and the updating of DT lines are similar to those applied in the EMBL Nucleotide Sequence Database.

Genome Reviews files are released fortnightly, successive releases are numbered 1, 2, 3. etc. Before this release (release 1), a number of pre-releases were made, numbered 0.1, 0.2, 0.3 etc. The release number was incremented directly from 0.6 to 1 for the first full release. Versioning of individual records was unaffected by this change in release numbering.

3.4.5 The DE Line

The DE (Description) lines contain general descriptive information about the sequence stored. The format for a DE line is:

DE   description

In the case of Genome Reviews files, this comprises a full description of the organism sequenced (including genus, species and any relevant sub-levels of classification); a description of the molecule sequenced; and a declaration that the file describes the sequence of that molecule, for example:

DE   Xylella fastidiosa (strain 9a5c) plasmid pXF1.3, complete sequence.

3.4.6 The KW Line

The format for a KW line is:

KW   keyword[; keyword ...].

Genome Reviews files typically contain 2 keywords: "complete genome" and "genome reviews"

3.4.7 The OS Line

The OS (Organism Species) line specifies the preferred scientific name of the organism which was the source of the stored sequence, by giving the Latin genus and species designations, followed by more a specific classification where known. The complete format of the OS line is as follows:

OS   Genus species ([sub-species] [serogroup] [biovar] [pathovar]
     [serovar] [strain] [sub-strain])

All descriptors below the level of species are contained in brackets. Descriptors at different levels are separated by commas; alternative names at a given level by the use of a forward slashes surrounded by a space on either side (' / '). An example is given in section 2.2.

3.4.8 The OC Line

The OC line describes the taxonomic lineage of the sequenced organism, down to the level of the genus, according to the NCBI taxonomy.

3.4.9 The OG Line

The OG (OrGanelle) linetype indicates the sub-cellular location of non-nuclear sequences. It is only present in entries containing non-nuclear sequences and appears after the last OC line in such entries. The OG line contains one data item, either "Mitochondrion", "Chloroplast", "Kinetoplast", "Cyanelle", "Plastid" or a plasmid name.

3.4.10 The OX Line

The OX line is used in gene records; it contains the NCBI tax ID of the species.

3.4.11 The OH Line

The OH (Organism Host) line specifies the most specific NCBI taxonomy ID and name of the host organism or host range.

OH   NCBI_Taxid: 272623; Lactococcus lactis (subsp. lactis, strain IL1403)

3.4.12 The Reference (RN, RC, RP, RX, RG, RA, RT, RL) Lines

The reference lines in a Genome Reviews entry have the same format as those in the EMBL Nucleotide Sequence Database. The policy for inclusion of references is described in section 2.3. Note the the RX line includes at present cross-references only to PubMed, following the discontinuation of separate Medline identifiers. A sample RX line is shown below:

RX   PUBMED; 10910347.

3.4.13 The FH Line

The FH line is used as in the EMBL Nucleotide Sequence Database.

3.4.14 The FT Line

The format of the Genome Reviews feature table is essentially the same as that used in the EMBL Nucleotide Sequence Database. For a full definition of that Feature Table, please see the document "The DDBJ/EMBL/GenBank Feature Table: Definition" (URL: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html). However, there are some revisions to the format that have been made for the purposes of Genome Reviews, and considerable changes to the contents of individual records. These changes are discussed in this section of this document. The revised format is described in Backhaus-Naur form in section 4.1.

As with EMBL features, the format design is based on a tabular approach and consists of the following items:

Feature key: a single word or abbreviation indicating functional group
Location: instructions for finding the feature
Qualifiers: auxiliary information about a feature

For Genome Reviews records, certain additional keys and qualifiers have been added to those allowed in the EMBL Nucleotide Sequence Database,and the format of information associated with certain qualifiers has been redefined. Notably, evidence tags have been introduced, to attribute the source of data imported into Genome Reviews records. Evidence tags (in general) are described in more detail in section 2.4., but their specific incorporation into feature qualifiers is discussed here.

In order to maximise compliance with existing EMBL parsers, evidence tags have been introduced as additional information included as part of the value of a feature qualifier. A conventional EMBL parser will, therefore, be expected to return the value plus evidence in response to a request for the value; to separate the two components, a deeper level of parsing will be required.

Evidence tags are always located at the end of the qualifier value. They are contained within curly braces (i.e. between the '{' character and the '}' character) and preceded by a space. The tag and the value it tags are both contained within a single pair of double quotes. Wrapping of feature qualifiers containing tags follows the standard rules for feature qualifiers not containing tags, i.e. the presence of evidence tags does not affect how lines are wrapped. Where the qualifier had no associated value, the definition of the qualifier has been changed for Genome Reviews such that a value has been added, consisting only of the tag.

Examples of the incorporation of evidence tags into feature qualifiers are given in section 2.4.

3.4.15 The SQ Line

The SQ line is used as in the EMBL Nucleotide Sequence Database.

3.4.16 The Sequence Data Line

The sequence data line is used as in the EMBL Nucleotide Sequence Database.

3.4.17 The CC Line

CC lines in EMBL records contain free text comments. CC lines are used in Genome Reviews to describe the primary source entry from which a Genome Reviews file was made, and the date on which it was produced. An example is given below.

CC   This Genome Reviews entry was created from entry AE003850.3 in the
CC   EMBL/GenBank/DDBJ databases on 26 September 2005.

It is possible that other forms of comment will be introduced in future.

3.4.18 The XX Line

The XX line is used as in the EMBL Nucleotide Sequence Database.

3.4.19 The // Line

The // line is used as in the EMBL Nucleotide Sequence Database.





4) DATA IMPORT PROCEDURES

This section of the manual describes the procedures used to import data into a Genome Reviews record. This information applies to all distribution formats of Genome Reviews; although the language of the flat file (e.g. "feature", "feature qualifier") is used to explain the procedures, the same data is also available in the relational distribution and visible in the Genome Reviews Browser. Likewise, certain EMBL-specific terms (like CDS, or CoDing Sequence, to refer to an annotated region of DNA that encodes a protein) are used, but the explanation applies to equivalent data in other source databases, regardless of the naming convention used in those resources.

4.1. Sequence source

The primary source of sequence (and other annotation) for a Genome Reviews record is usually the corresponding submission to the EMBL Nucleotide Sequence Database. However, for certain model organisms, the latest assembled sequence has not been submitted to EMBL. In these cases, Genome Reviews directly sources the DNA sequence from an accessible alternative source. Currently there are three of these: the Saccharomyces Genome Database (SGD) (Balakrishnan R. et. al. Nucleic Acids Res. 2005 Jan 1; 33 Database Issue:D374-7), used to source sequence information about the budding yeast Saccharomyces cerevisiae, and the Arabidopsis information Resource (TAIR) (Rhee S.Y. et al, Nucleic Acids Research 2003 31(1):224), used to source information about the thale cress Arabidopsis thaliana. Ustilago maydis data has been provided by the Munich Information Center for Protein Sequences (MIPS). The U. maydis chromosomes, originally sequenced and annotated by the Broad Institute, have been re-annotated at MIPS as part of their Ustilago maydis Annotation Project (Nature 2006 444, 97-101). Data is available from MUMDB.

From all three primary primary data sources, the same essential procedure is followed. The latest assembled chromosomal DNA sequence is accessed, as is annotation associated with this sequence. The methods described in sections 4.2-4.4 are then used to improve and enhance this primary data.

4.2. Data import through identifier matching

Much of the data imported into Genome Reviews is found in external databases. The use of common identifiers by different databases, and the maintenance of specific cross-references between them, can be used to identify equivalent entities and allow the transfer of annotation. This principle of this is discussed in section 2.4.1 of this document, and is applied using protein identifiers to map between the features described in the EMBL Nucleotide Sequence Database and records in the UniProt Knowledgebase, and thereafter to other resources

When primary sequence data is sourced from either SGD, TAIR, or MIPS, a further step is added to the procedure, as cross-references between these resources and the UniProtKB are less well synchronised. In these cases, the sequence similarity approach (see section 4.3) is used to identify UniProtKB records cross-referencing to annotated features in the source database.

4.3. Data import through sequence matching

Sequence similarity comparisons are run for two reasons during Genome Reviews production. Firstly, if the database used to source the sequence was not EMBL, the blastp protein sequence similarity algorithm from the BLASTALL package (Version 2.2.6 (4/9/03); Altschul et al., Nucl. Acids Res. (1997) 25:3389-3402) to identify the best matching entry in the UniProtKB for each annotated CDS feature in the source database. As explained in section 4.2, if the source database is EMBL, protein identifiers can be used to map UniProtKB to EMBL and sequence matching is not required.

Secondly, sequence similarity matching may be performed to locate the accurate location of a CDS on the genome sequence corresponding to a protein sequence has been reported but which does not correspond to any existing annotated CDS in the source database. At present, the only database from which such non-annotated or incorrectly annotated protein sequences are identified, and subsequently mapped to the corresponding genomic sequence, is UniProtKB/Swiss-Prot, the manually curated portion of the UniProtKB.

Reasons why a sequence in the UniProtKB may not correspond exactly to an annotated CDS feature in the source database include the following:

(i) sequence variation between individual members of one species

(ii) errors in DNA sequence

(iii) errors in gene prediction i.e. missing predictions

(iv) errors in boundary prediction i.e. start/stop codon incorrectly annotated

(v) errors in translation prediction i.e. an authentic frameshift may have been missed

(vi) two CDSs are annotated, but the UniProtKB curator believes that only a single protein is actually encoded

(vii) one CDSs are annotated, but the UniProtKB curator believes that two separate proteins are is actually encoded

.

In Genome Reviews, we aim to provide a consistent picture of genomes and proteins. As such, we correct the primary source data to be consistent with the reference protein sequences provided in well-curated protein databases, wherever it is practical and meaningful to do this. We have implemented a pipeline that, with each Genome Reviews release, maps sequences in the UniProt Knowledgebase (from all archaeal and bacterial species represented in Genome Reviews) without an exact sequence match to an annotated feature in the source database entry describing the corresponding genome back onto the DNA sequence (we do not currently run this pipeline for the eukaryotic species represented in Genome Reviews). New, or adjusted, coordinates on the genome sequence defining the region encoding the "missing" protein are determined, and exceptions in the translation pattern are identified. These novel/adjusted annotations are then selectively imported into Genome Reviews.

At present, we importing new/revised CDS features in the following cases:

(i) Completely unannotated proteins, where there are <= 3 exceptions between the translation pattern predicted from the DNA, and the protein sequence being mapped. i.e. a maximum of 3 locations where either the reference protein sequence contains an amino acid other than that one would expect by translating the corresponding codon in the DNA sequence with the standard genetic code for that species, or where the reading frame appears disrupted in some way.

(ii) Proteins that are mistakenly annotated with too many CDSs in the primary data source.

(iii) Multiples of proteins that are mistakenly annotated by a single CDS in the primary data source.

Having calculated the co-ordinates of the new/adjusted features, and selected which cases are to be included in Genome Reviews, it is now necessary to translate the results of the mapping (i.e. a protein-DNA alignment process) into the Genome Reviews flat file format. Crucially, any discrepancy between the protein sequence associated with a (new or modified) CDS feature, and the translation of the corresponding region of DNA sequence, needs to be fully described. To do this, we have introduced a small extension to the EMBL file format.

(i) Simple translation exceptions (where amino acid X1 is found in the protein sequence, where amino acid X2 was expected from the DNA sequence) can be dealt with easily in the EMBL format using the /transl_except feature qualifier.

(ii) Apparent frameshifts (any discontinuity that disrupts the reading frame of the literal translation) are harder to deal with (these may be the result of real frameshift events, of errors in sequence or of natural variation; but given the existence of such a discrepancy, it needs to be described regardless of the cause). The join statement can be used to represent the existence of apparently unused nucleotides (i.e. a coding region is defined in two portions, excluding those nucleotides that appear not to be part of any codon actually used to encode the reference protein sequence).

(iii) The other possibility is that nucleotides are apparently absent from the DNA sequence (and a codon corresponding to a certain amino acid in the protein sequence cannot be found). A new feature qualifier /insertion has been introduced to the Genome Reviews file format to represent this.

It is always possible to use the results of a sequence alignment to place new CDS features onto the genome that translate to a corresponding protein sequence subject to the differences described through the use of /transl_except and /insertion features, and join statements. Sometimes it is possible to represent the results of an alignment in more than one way, for example, a /transl_except feature could always be represented by an insertion and a deletion (join). In Genome Reviews, the /transl_except feature qualifier is only used in cases where there is a single mismatching codon within a well-defined reading frame. The /transl_except qualifier is not used at the margin of deletions or insertions, where one or both neighbours of an "exceptional" codon themselves fail to match to the protein sequence. Some simple examples are given in the next section.

4.3.1. Representation of aligned sequences and their discrepancies: some examples

According to EMBL format, where the actual product of a CDS does not match the translation of the DNA according to the specified genetic code, the "real" sequence is entered under the /translation feature qualifier and any discrepancies are annotated accordingly. A join statement can be used to indicate a real or apparent frameshift, and the /exception and /transl_except feature qualifiers can also be used to indicate discrepancies. We have extended this scheme of Genome Reviews to enable us to describe the complete result of a sequence alignment using feature qualifiers and join statements in a standardised way. This enables the development of software capable of automatically "translating" from DNA to protein subject to the exceptions defined.

We can consider three types of sequence discrepancies that may disrupt an alignment between the protein sequence and the matching region of DNA: deletion (i.e. there are >=1 bases in the DNA sequence that do not seem to be part of any codon), insertion (there are amino acids in the protein sequence that do not seem to have corresponding codons in the DNA sequence) and mis-translation. Such discrepancies do not necessarily represent biological phenomena, but may result from artifacts of the procedure by which the sequences were determined.

1: Deletion;

During an Alignment, we may come across regions where one sequence does not match the other, and the algorithm used has decided to place a gap in the alignment rather than align two different elements.

e.g. (Assuming we are aligning at the Amino Acid level)

    abc-def         (Protein Sequence; '-' represents a 'gap')
    |||.|||         (Aligned Regions; '|' represents a match; '.' represents a 'gap')
    abccdef         (Genomic Sequence)

This 'Deletion' event will not be explicitly labelled in the Genome Reviews files, but will be expressed by using the EMBL convention of adding an extra coordinate range to the JOIN statements found in the CDS feature instead.

i.e. (Assuming our alignment shows amino acid sequences, and the offset of 1000 is purely for illustration)

                   join(1000..1008,1012..1017)

In the Genome Reviews context, this occurs when the reference proteins sequence does not match to the underlying genomic sequence, thus we add the 'gap' to the reference protein sequence in order to attain a better alignment.

Deletions can span 1 or more nucleotides, though most are single amino acid triplet, and as they may not all be of unit triplet length (i.e. unit Amino Acids), there may be also an implicit frame shift that occurs.

2: Insertion;

This will occur when the opposite of a Deletion happens; when the protein sequence contains regions that are not found in the underlying genomic sequence.

e.g. (A reversal of the above example)

    abccdef     (Protein Sequence)
    |||.|||     (Aligned Regions)
    abc-def     (Genomic Sequence)

This 'Insertion' event will be explicitly labelled in the Genome Reviews files.

It will appear as a separate sub-feature of the /CDS feature tag and be identified as "/insertion=..."

i.e.

                   /insertion="1008^1009,seq:C"

In the Genome Reviews context, this is the reverse of the deletion event, thus we add the 'gap' to the genomic sequence in order to attain a better alignment.

Insertions are allowed to span several amino acids, and as such are represented by their respective single character symbols. Where an insertion consists of multiple amino acids, these are presented according to their order in the protein. The numerical range (indicated by two sequential integers separated by a caret) indicates the nucleotides either side of the "missing" codon. Thus if there was a protein-coding DNA sequence as follows:

ATGATG

and we were to imagine the presence of two amino acids (A and B, in that order) between the two encoded methionines, this would be annotated as such:

FT   CDS             1..6
FT                   /insertion="3^4,seq:AB"
FT                   /translation="MABM"
                        

whereas if the CDS extended in the opposite direction, and we were to imagine the presence of two amino acids (A and B, in that order) between the two encoded histidines, this would be annotated in this way:

FT   CDS             complement(1..6)
FT                   /insertion="3^4,seq:AB"
FT                   /translation="HABH"

Special case: C-terminal insertion:

If in an alignment, the coding sequence that has been aligned with the genomic sequence shows a discrepancy at the 3' end (corresponding with the C-terminus of the encoded protein), and the underlying genomic sequence does not have a stop codon in that region, we represent the end location of the CDS feature with the position of the last nucleotide of the last matching triplet and we add an insertion between that position and the next nucleotide in the genomic sequence. The inserted sequence consists of the C-terminal amino acid(s) after the last match.

3: Translation Exception;

These occur when the protein sequence has regions which are genuinely different from the underlying sequence, but the alignment algorithm has determined that it is best to keep the mismatch rather than adding a 'gap' to the range.

This would typically occur when the regions surrounding the mismatch are good matches themselves, and adding a gap would adversely affect these regions.

e.g.

    abccdef     (Protein Sequence)
    |||!|||     (Aligned Regions, The '!' symbol is used to highlight the affected region)
    abczdef     (Genomic Sequence)

This 'Translation Exception' event is explicitly labelled in the Genome Reviews files. It appears as a separate qualifier of the CDS feature tag and be identified as "/transl_except=..."

                   /transl_except="(pos:4235116..4235118,aa:Cys)"

In a Genome Reviews context, we represent each single amino acid that differs as a separate element and hence we will use the standard three letter code to represent each, rather than just the single letter code.

 

When a UniProt Knowledgebase sequence is mapped to an EMBL feature, and as a result the original coordinates are altered, the genomic sequence, in a small number of cases the genomic sequence does not contain a termination codon. In such cases, a translation exception event is introduced indicating the position of the missing stop codon, e.g.

                   /transl_except="(pos:12233..12235,aa:TERM)"

                        

See Appendix III for more details on the application of each feature qualifier type.

 

 

4.3.2. Outline of mapping procedure for unannotated CDSs

Records in the UniProt Knowledgebase representing proteins encoded by completely deciphered genomes, but which have not been annotated in the corresponding EMBL entry, are identified (such records are annotated as "unannotated CDSs"). These records are compared to the genome sequence using the program tblastn from the package BLASTALL (Version 2.2.6 (4/9/03); Altschul et al., Nucl. Acids Res. (1997) 25:3389-3402) to identify exact matches, i.e.cases where the a continuous coding sequence for entire protein can be identified on part of the corresponding DNA sequence, with no gaps or other irregularities. In such cases, the location data is included directly into the GenomeReviews data files.

If however, a location is not found by this method, a customised version of the ALIGN tool (version 2.0u by Myers and Miller, CABIOS (1989) 4:11-17) is used to try and locate a CDS for the protein, which may be non-contiguous and thus also require more annotation to describe the differences between the protein and genomic sequences.

The same procedure is also used to map sequences in the UniProt Knowledgebase that are mapped to coding sequences in the EMBL genome records (indicated in the UniProt entry through the presence of a cross-reference to the EMBL protein identifier), but where the sequence in the UniProt entry does not agree with the translation in the of the EMBL feature. In this case the aforementioned ALIGN program is used to find the regions of DNA in a state of disagreement with the reported protein sequence. Once identified these regions are referenced to the corresponding amino acids and the resulting conflicts between the UniProt protein sequence and the EMBL DNA sequence can thus be expressed through the use of join statements and certain feature qualifiers to be added to the Genome Reviews files, as discussed in the previous section.

Data from the mapping process is selectively being incorporated into Genome Reviews.

4.3.3. Example of a new feature added through sequence comparison


FT   CDS             complement(34089..34256)

FT                   /evidence="{BLASTALL 2.2.6/ALIGN 2.0u}"

FT                   /product="Hypothetical protein yaaV
FT                   {UniProtKB/Swiss-Prot:P46415}"

FT                   /dbxref="UniParc:UPI0000139FFD {UniProtKB/Swiss-Prot:P46145}"
FT                   /dbxref="UniProtKB/Swiss-Prot:P46145 {UniProtKB/Swiss-Prot:P46145}"

FT                   /db_xref="Ecogene:EG12706 {UniProtKB/Swiss-Prot:P46145}"
FT                   /translation_table=11
FT                   /translation="MTRFRAIKQHKIVDISIVCNNFTVDKCELNPAYVIKNIDSPKDL
FT                   LNGQKKTVLIREPY"


4.4. Data import through sequence analysis

Annotation of non-protein-coding genes in completely sequenced genomes is erratic and in some cases wholly absent. We have addressed this by running several computational analyses of genomes in which such genes have not been annotated, and by adding new annotations for such genes according to the results. We have implemented three pipelines of analysis, depending on the type of RNA genes we are aiming to detect.

4.4.1. Detection of non-protein-coding genes

Analysis of non-protein-coding genes, other than transfer RNA genes and ribosomal RNA genes, as well as RNA motifs, is performed in a general manner, using the program rfam_script.pl. rfam_scan.pl is a perl wrapper for searching DNA sequences against the latest Rfam Database (Griffiths-Jones S. et al. (2005)) using the INFERNAL software package (Eddy S.R. (2002)).

Depending on their type, the genes detected by this procedure are annotated as:

  • tmRNA features transfer-messenger RNA genes.
  • ncRNA features for all defined non-protein-coding genes other than tRNA or rRNA or tmRNA genes.
  • misc_RNA features for RNA genes with undefined classification.
RNA motifs are annotated as:
  • misc_structure features
  • stem_loop features

This classification conforms to proposals currently in the process of being implemented in the EMBL Nucleotide Sequence Database, and represents a change to the previous classification used in that database and Genome Reviews. As a consequence of this new classification, the snoRNA and snRNA feature types have been replaced by ncRNA feature type, whose subtype is defined in a ncRNA_class qualifier. The values of this qualifier are restricted to the following controlled vocabulary:

  • antisense_RNA
  • autocatalytically_spliced_intron
  • hammerhead_ribozyme
  • RNase_P_RNA
  • RNase_MRP_RNA
  • telomerase_RNA
  • guide_RNA
  • rasiRNA
  • scRNA
  • siRNA
  • miRNA
  • snoRNA
  • snRNA
  • SRP_RNA
  • stRNA
  • vault_RNA
  • Y_RNA

4.4.2. Detection of transfer RNA genes

Analysis is specifically performed using tRNAScan-SE.

See Lowe T.M. and Eddy S.R. (1997) "tRNAScan-SE: a program for improved detection of transfer RNA genes in genomic sequence", Nucl. Acids Res., 25, 955-964

See also http://www.genetics.wustl.edu/eddy/tRNAScan-SE/

Version 1.23 of the program was used, configured for superregnum as appropriate.

New tRNA-encoding genes are annotated as tRNA features in the following way:

FT   tRNA            complement(28316..28391)
FT                   /evidence="{tRNAScan-SE-1.23}"
FT                   /gene_name="tRNA:Ile (GAU) {tRNAScan-SE-1.23}"
FT                   /anticodon=(pos:28356..28358,aa:Ile)

The codon and the corresponding amino acid are presented under the /gene_name qualifier. Note that if the tRNA gene is found on the reverse strand, the direction is indicated through the use of "complement", as with CDS features (but unlike tRNA genes in EMBL records).

We are looking into the possibility of upgrading/replacing original tRNA annotations in future releases.

4.4.3. Detection of ribosomal RNA genes

Analysis is specifically performed using RNAmmer.

See Lagesen K. et al. (2007) "RNammer: consistent annotation of rRNA genes in genomic sequences", Nucl. Acids Res., 35, 3100-3108

See also http://www.cbs.dtu.dk/services/RNAmmer

Version 1.2 of the program was used, configured for superregnum as appropriate.

New ribosomal RNA-encoding genes are annotated as rRNA features in the following way:

FT   rRNA            complement(28316..28391)
FT                   /evidence="{RNAmmer-1.2}"
FT                   /gene_name="16s_rRNA {RNAmmer-1.2}"

5) GENOME REVIEWS GENE RECORDS

A gene record describes the DNA sequence that encodes the products of a gene. Where expression information is available about specific genes (e.g. information about promoters/UTRs), this has been used in defining the gene. However, for many genomes included in Genome Reviews, no specific information is available at the gene level; in these cases, a virtual gene is assumed to exist comprising all splicing variants, by taking the start and end coordinates of the longest coding sequence.

Individual regions of known function or character are annotated as features on the sequence. Where a protein sequence in the UniProtKB does not exactly correspond with the nucleotide sequence in the gene set, differences between the conceptual translation and the actual protein sequence are represented through the use of qualifiers in the annotation, as they are in Genome Reviews component records.

Data files representing sets of records, comprising the genes derived from each genome component molecule (and each complete genomes), can be downloaded for all genomes in Genome Reviews. All data are available in FASTA format, and additionally in a richer file format, EMBL CDS-like format, containing more detailed, structured annotation.

Gene sets for components and complete genomes can be accessed by ftp using the appropriate link below:

In addition, Gene Sets can be searched in SRS as explained in paragraph 7.2. Instead of "Genome Reviews", the library "GR Gene Sets" should be selected.

6) GENOME REVIEWS TRANSCRIPT RECORDS

A transcript record describes a processed transcript after any post-transcriptional events, such as splicing, may have occurred. In the case of polycistronic transcription, all genes that are known to be transcribed together will be part of the same transcript record. Where information is available about specific transcripts (e.g. information about promoters/UTRs; co-transcription; or alternative translational information), this has been used in defining the transcripts. However, for many genomes included in Genome Reviews, no specific information is available at the transcript level; in these cases, a virtual transcript is assumed to exist for each known protein product. Information about the relationship between entities at each of these levels is summarised in this README file.

As with complete molecules and gene records, individual regions of known function or character are annotated as features on the sequence.

Data files representing sets of records, comprising the transcripts derived from each genome component molecule (and each complete genomes), can be downloaded for all genomes in Genome Reviews. All data are available in FASTA format, and additionally in a richer file format, EMBL CDS-like format, containing more detailed, structured annotation.

Transcript sets for components and complete genomes can be accessed by ftp using the appropriate link below:

7) SEARCHING AND DOWNLOADING GENOME REVIEWS

7.1 Downloading Genome Reviews

Access to Genome Reviews flat files is described in the release notes, available at ftp://ftp.ebi.ac.uk/pub/databases/genome_reviews/ReleaseNotes.txt.

7.2 Searching Genome Reviews through SRS

Genome Reviews is available for search under the EBI's SRS server, as follows:

  1. Goto http://srs.ebi.ac.uk.
  2. From the top row of tabs, click on the tab labelled "Library Page".
  3. Under the section "Nucleotide Sequence Databases" in the main body of the page, check the box marked "Genome Reviews"
  4. Either
    • Enter a search term in the "Quick Search" box at the top of the page, or
    • Click on the "Standard Query Form" button in the left hand margin, or
    • Click on the "Extended Query Form" button in the left hand margin.

For more information on how to use the Query Forms, please see the SRS documentation at http://srs.ebi.ac.uk/doc/index.html

7.3 Searching Genome Reviews through Integr8

An Ensembl-style browser is now available for Genome Reviews, providing a zoomable graphical view of all chromosomes and plasmids represented in the database.  The location and structure of all genes is shown and the distribution of features throughout the sequence is displayed. In the search box on either the Genome Reviews or Integr8 homepage, select your gene or protein of interest, optionally specify the species, and then click on the icon to enter the browser centred on the specified gene. More information about using the Integr8 search facility is available. Help is available when using the browser by clicking on the help icon that appears on every page.

7.4 Installation of a local Genome Reviews MySQL database

The file gr_mysql-release_xx.sql.gz at ftp://ftp.ebi.ac.uk/pub/databases/genome_reviews/sql/ represents an export of Genome Reviews release xx from a MySQL database. The data corresponds to the same data found in the flat file distribution (ftp://ftp.ebi.ac.uk/pub/databases/geneome_reviews/dat/). The database schema is essentially the same as that used by Ensembl (http://www.ensembl.org/) to describe higher eukaryotic genomes.

USAGE:

We recommend you use MySQL version 4 or later, though previous versions may work. If you get a 'Packet too large' error from your MySQL server, you will have to increase the value of the server variable max_allowed_packet above the default value of 1M to at least 32M. For MySQL version 4.0 or later, this can be done by adding the following line to the [mysqld] section of your my.cnf file and restarting the MySQL server daemon.

[mysqld]
max_allowed_packet = 32M

Please note that MySQL version prior to 4.0 do not support higher values. See your MySQL documentation for further information.

Download the latest genome reviews schema from ftp://ftp.ebi.ac.uk/pub/databases/genome_reviews/sql/, file

gr_mysql-release_xx.sql.gz where xx is the current release.

If you are using unix/linux then you can import the schema and its data into your MySQL database as follows:

  • use an existing database or create a new one. Please note that the loaded schema and data will overwrite an existing schema and data in the database.

The command line options are as follows:



$HOST - the host (optional if you are using localhost).
$PORT - the port (optional if you are using 3306).
$USER - the user account which you will be loading the schema. The account must have permission to populate the database
$PASS - the password for the above account.
$DBNAME - the name of the database where the scheme and data will be loaded.

  • Execute the following command substituting the $ variables with your own values:

Unix/linux (command is on a single line):

gunzip -c gr_mysql-release_xx.sql.gz | mysql -h $HOST -P $PORT
-u $USER -p$PASS -D $DBNAME

Windows:
Unzip gr_mysql-release_xx.sql.gz (using WinZip or similar) into gr_mysql-release_xx.sql.

mysql -h $HOST -P $PORT -u $USER -p$PASS $DBNAME -f < gr_mysql-release_xx.sql

8) APPENDICES


8.1 Appendix I: Feature table: Backus-Naur form

Feature table is a mandatory part of an entry. Full entry syntax is specified elsewhere. This definition is an amended version of the feature table definition given in the EMBL feature table document (URL: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html). The primary formal changes in specification derive from the the introduction of evidence tags into the feature qualifiers of Genome Reviews records.

feature_table ::= <feature_table_header><feature_table_body>

feature_table_header ::= FH Key Location/Qualifiers |
FEATURES Location/Qualifiers

feature_table_body ::= <feature> | <feature_table_body><feature>

At least one feature is required.

feature ::= <feature_key><feature_details>

Key is required, location required, qualifier list optional

feature_key ::= <symbol>

feature_details ::= <location><qualifier_list> | <location>

There exists a table of legal keys.

location ::= <absolute_location> | <feature_name> |



<functional_operator>(<location_list>)

absolute_location ::= <local_location> | <path> : <local_location>

path ::= <database> :: <primary_accession> | <primary_accession>

feature_name ::= <path>:<feature_label> | <feature_label>

feature_label :== <symbol>

local_location ::= <base_position> | <between_position> | <base_range>



location_list ::= <location> | <location_list>,<location>

functional_operator ::= <symbol>

base_position ::= <integer> | <low_base_bound> | <high_base_bound> |






<two_base_bound>


low_base_bound ::= > <integer>

high_base_bound ::= < <integer>

two_base_bound ::= <base_position>.<base_position>

between_position ::= <base_position>^<base_position>

base_range ::= <base_position>..<base_position>

database  ::= <symbol>

primary_accession ::= <symbol>

sequence_character ::= a | b | c | d | g | h | k | m | n | r | s | t | u | v | w | y

qualifier_list ::= <qualifier> | <qualifier_list><qualifier>

qualifier ::= /<qualifier_name> | /<qualifier_name>=<value>

qualifier_name ::= <symbol>

value ::= <simple_value> | (<value_list>) | (<tagged_value_list>) |

simple_value ::= <integer> | <location> | <reference_number> | "<text_string>" |







"<text_string> <evidence_tag>" | <symbol>

value_list ::= <value> | <value_list>,<value>

tagged_value_list ::= <tagged_value> | <tagged_value_list>,<tagged_value>

tagged_value ::= <tag>:<value>

tag ::= <symbol>

reference_number ::= [ <unsigned_integer> ]

symbol  ::= <letter> | <symbol><symbol_character> | <symbol_character><symbol>

text_string ::= <string_character>| <text_string><string_character>

evidence_tag ::= { <evidence_item_list> }

evidence_item_list = <evidence_item> | <evidence_item>; <evidence_item_list>




evidence_item  ::= <text>:<evidence_value>

evidence_value ::= <text> | !<text>

unsigned_integer ::= <digit> |  <unsigned_integer><digit>

integer ::= <unsigned_integer> | - <unsigned_integer>

string_character ::= <letter> | <digit> | <punctuation> | ""

symbol_character ::= <up_case_letter> | <low_case_letter> |<digit> | _ | - | ' | *

letter ::= <up_case_letter> | <low_case_letter>

up_case_letter ::= A | B| ... | Z

low_case_letter ::= a | b | ... | z

digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

punctuation ::= <space> | ! | # | $ | % | & | ' | ( | ) | * | + | , |
 - | . | / | : | ; | < | = | > | ? | @ | [ | \ | ] | ^ | _ | ` | { |
 <bar> | } | ~

bar ::= |

space ::= ascii 32



                        


8.2 Appendix II: Feature keys reference

This appendix describes the use of features and their associated qualifiers in Genome Reviews records. The number of allowable feature keys and qualifier names has been reduced to standardise their usage, but some new feature keys and qualifier names have also been introduced. These lists are liable to revision in subsequent releases.

8.2.1 Feature key reference manual


The following manual has been organised according to the following format:
Feature Key             the feature key name
Definition              the definition of the key
Mandatory qualifiers    qualifiers required with the key; if there are
                        no mandatory qualifiers, this field is omitted.
Optional qualifiers     optional qualifiers associated with the key
Comment                 comments and clarifications

Abbreviations:
accnum                  an entry primary accession number

<evidence_tag>          evidence tag, as discussed in sections 2.4,
                        3.16 and 4.4
<integer>               unsigned integer value

Feature Key           CDS

Definition            coding sequence; sequence of nucleotides that
                      corresponds with the sequence of amino acids in a
                      protein (location includes stop codon);
                      feature includes amino acid conceptual
                      translation;
Optional qualifiers   /biological_process="<GO_term> <evidence_tag>"
                      /cellular_component="<GO_term> <evidence_tag>"
                      /db_xref="<database>:<identifier> <evidence_tag>"
                      /EC_number="text <evidence_tag>"
                      /function="text <evidence_tag>"
                      /gene_name="text <evidence_tag>"
                      /gene_synonym="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
                      /product="text <evidence_tag>"
                      /product_synonym="text <evidence_tag>"
                      /protein_id="<identifier> <evidence_tag>"
                      /pseudo="<evidence_tag>"
                      /translation="text"
                      /transl_table =<integer>

Comment               /codon_start has a valid value of 1 or 2 or 3,
                      indicating the offset at which the first
                      complete codon of a coding feature can be found,
                      relative to the first base of that feature;
                      /transl_table defines the genetic code table
                      used if other than the universal genetic code
                      table; genetic code exceptions outside the range
                      of the specified tables are reported in /codon
                      or /transl_except qualifier /protein_id consists
                      of a stable ID portion (3+5 format with 3
                      position letters and 5 numbers) plus a version
                      number after the decimal point; when the protein
                      sequence encoded by the CDS changes, only the
                      version number of the /protein_id value is
                      incremented; the stable part of the /protein_id
                      remains unchanged and as a result will
                      permanently be associated with a given protein;
                      /transl_table and /translation not used for
                      pseudogenes (i.e. not used in conjunction with
                      /pseudo).

Feature Key           conflict
Definition            independent determinations of the "same" sequence
                      differ at this site or region
Mandatory qualifiers  /citation=[number]
Optional qualifiers   /replace="text"

Feature Key           gap
Definition            gap in the sequence
Mandatory qualifiers  /estimated_length=unknown or 
Optional qualifiers   /note="text"

Feature Key           mat_peptide
Definition            mature peptide or protein coding sequence; coding
                      sequence for the mature or final peptide or
                      protein
                      product following post-translational
                      modification; the location does not include the
                      stop codon (unlike the corresponding CDS);
Optional qualifiers   /db_xref="<database>:<identifier> <evidence_tag>"
                      /EC_number="text <evidence_tag>"
                      /evidence="<evidence_tag>"
                      /gene_name="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
                      /product="text <evidence_tag>"
                      /product_synonym="text <evidence_tag>"

Feature Key           misc_RNA
Definition            any transcript or RNA product that cannot be defined by other RNA keys
                      (prim_transcript, precursor_RNA, mRNA, 5'clip, 3'clip, 5'UTR, 3'UTR, exon,
                      CDS, sig_peptide, transit_peptide, mat_peptide, intron, rRNA, tRNA, ncRNA
		      and polyA_site)
Optional qualifiers   /gene_name="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
                      /pseudo="<evidence_tag>"
		      /db_xref="<database>:<identifier>"
                      /experiment="text"

Feature Key           misc_structure
Definition            any secondary or tertiary nucleotide structure or conformation
                      that cannot be described by other Structure keys (stem_loop and D-loop);
Optional qualifiers   /gene_name="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
		      /db_xref="<database>:<identifier>"
                      /experiment="text"

Feature Key           ncRNA
Definition            a non-protein-coding gene, other than ribosomal RNA, transfer RNA, and transfer-messenger RNA,
                      the functional molecule of which is the RNA transcript.
Optional qualifiers   /allele="text"
                      /citation="number"
                      /db_xref="<database>:<identifier>"
                      /experiment="text"
                      /function="text"
                      /gene_name="text <evidence_tag>"
                      /inference="TYPE[ (same species)][:EVIDENCE_BASIS]"
                      /label=feature_label
                      /locus_tag="text <evidence_tag>"
                      /map="text"
                      /note="text"
                      /old_locus_tag="text"
                      /product="text"
                      /pseudo="<evidence_tag>"
                      /standard_name="text"
                      /trans_splicing
                      /operon="text"
Mandatory qualifiers  /ncRNA_class="TYPE"
Comment               The ncRNA feature is not used for ribosomal, transfer-messenger,
                      transfer RNA annotation, for which the rRNA and tRNA feature keys
                      should be used, respectively.

Feature Key           operon
Definition	      region containing polycistronic transcript
                      containing genes that encode enzymes that are
                      in the same metabolic pathway and regulatory sequences;
Mandatory qualifiers  /operon_name="text <evidence_tag>"
Optional qualifiers   /db_xref="<database>:<identifier> <evidence_tag>"

Feature key           peptide
Definition            released active peptide; a peptide will
                      usually be a small region of up to 100 residues;
Optional qualifiers   /db_xref="<database>:<identifier> <evidence_tag>"
                      /EC_number="text <evidence_tag>"
                      /evidence="<evidence_tag>"
                      /gene_name="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
                      /product="text <evidence_tag>"
                      /product_synonym="text <evidence_tag>"

Feature Key           prim_transcript
Definition            primary (initial, unprocessed) transcript; in Genome Reviews, used to describe
                      transcription units (polycistronic transcripts belonging to an operon);
Optional qualifiers   /db_xref="<database>:<identifier> <evidence_tag>"
                      /gene_name="text <evidence_tag>"
                      /operon_name="text <evidence_tag>"
                      /promoter=="text <evidence_tag>"

Feature Key           pro_peptide
Definition            activation peptide.
Optional qualifiers   /db_xref="<database>:<identifier> <evidence_tag>"
                      /EC_number="text <evidence_tag>"
                      /evidence="<evidence_tag>"
                      /gene_name="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
                      /product="text <evidence_tag>"
                      /product_synonym="text <evidence_tag>"

Feature Key           rRNA
Definition            mature ribosomal RNA ; RNA component of the
                      ribonucleoprotein particle (ribosome) which
                      assembles amino acids into proteins.
Optional qualifiers   /gene_name="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
                      /pseudo="<evidence_tag>"
		      /db_xref="<database>:<identifier>"

Feature Key           sig_peptide
Definition            signal peptide coding sequence; coding sequence
                      for an N-terminal domain of a secreted protein;
                      this domain is involved in attaching nascent
                      polypeptide to the membrane leader sequence;
Optional qualifiers   /db_xref="<database>:<identifier> <evidence_tag>"
                      /evidence="<evidence_tag>"
                      /gene_name="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
                      /product="text <evidence_tag>"
                      /product_synonym="text <evidence_tag>"

Feature Key           source
Definition            identifies the biological source of the
                      specified span of the sequence; this key is
                      mandatory; more than one source key per sequence
                      is allowed; in Genome Reviews, every entry will
                      have a single source key spanning the entire
                      sequence;
Mandatory qualifiers  /db_xref="taxon:<identifier>"
                      /organism="text"
                      /mol_type="genomic DNA"
Optional qualifiers   /biovar="text"
                      /chromosome="text"
                      /cultivar="text"
                      /focus
                      /host="text"
                      /host_range="text"
                      /pathovar="text"
                      /plasmid="text"
                      /proviral
                      /segment="text"
                      /serovar="text"
                      /strain="text"
                      /sub_species="text"
                      /sub_strain="text"
                      /variety="text"

Feature Key           stem_loop
Definition            hairpin; a double-helical region formed by base-pairing between adjacent
                      (inverted) complementary sequences in a single strand of RNA or DNA.
Optional qualifiers   /gene_name="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
		      /db_xref="<database>:<identifier>"
		      /experiment="text"

Feature Key           transit_peptide
Definition            transit peptide coding sequence; coding sequence
                      for an N-terminal domain of a nuclear-encoded
                      organellar protein; this domain is involved in
                      post-translational import of the protein into
                      the organelle;
Optional qualifiers   /db_xref="<database>:<identifier>"
                      /evidence="<evidence_tag>"
                      /function="text"
                      /gene_name="text"
                      /locus_tag="text"
                      /product="text"
                      /product_synonym="text <evidence_tag>"

Feature Key           tRNA
Definition            mature transfer RNA, a small RNA molecule (75-85
                      bases long) that mediates the translation of a
                      nucleic acid sequence into an amino acid
                      sequence;
Optional qualifiers   /db_xref="<database>:<identifier>"
                      /evidence="<evidence_tag>"
                      /gene_name="text<evidence_tag>"
                      /locus_tag="text <evidence_tag>"
                      /pseudo="<evidence_tag>"
                      /anticodon=(pos:<base_range>,aa:<amino_acid>)

Feature Key           tmRNA
Definition            transfer messenger RNA; tmRNA acts like a tRNA first,
                      and then an mRNA that encodes a peptide tag.
                      The ribosome translates this mRNA region of tmRNA and
                      attaches the encoded peptide tag to the C-terminus of the
                      unfinished protein. This attached tag targets the protein for
                      destruction or proteolysis
Optional qualifiers   /gene_name="text <evidence_tag>"
                      /locus_tag="text <evidence_tag>"
		      /allele="text"
                      /citation="number"
                      /db_xref="<database>:<identifier>"
                      /experiment="text"

		      


8.3 Appendix III: Summary of qualifiers for feature keys

8.3.1 Qualifier List

The following is a list of available qualifiers for feature keys and their usage. It also describes the procedures by which data of each type is identified and imported. A full list of sources from which data is imported is given in Appendix 8.5.

As noted in section 4.2, the number of qualifiers has been reduced in Genome Reviews compared with the EMBL Nucleotide Sequence Database. Some new qualifiers have been added and the data content of some qualifiers has been altered. The most notable change to feature qualifiers has been the introduction of evidence tags, described in section 2.4 of this document.

In this Appendix, "EMBL" refers to the EMBL Nucleotide Sequence Database; "UniProt" to the UniProt Knowledgebase; "UniParc" to the UniProt Archive; and "GOA" to the Gene Ontology Annotation Database (see Appendix 8.5).

Qualifier       name of qualifier; qualifier requires a value if
                followed by an equal sign
Definition      definition of the qualifier
Value format    format of value, if required
Example         example of qualifier with value
Comment         comments, questions and clarifications
Data source     explanation of how data of this type is sourced for Genome Reviews


Qualifier       /anticodon
Definition      location of the anticodon of tRNA and the amino acid for which
                it codes
Value format    (pos:<base_range>,aa:<amino_acid>) where base_range is the
                position of the anticodon and amino_acid is the abbreviation
                for the amino acid encoded
Example         /anticodon=(pos:34..36,aa:Phe)
Data source     may be taken from the parental EMBL entry, or derived through
                sequence analysis;

Qualifier       /biological_process=
Definition      biological process to which the product of this CDS
                takes part in;
Value format    "<GO_term> <evidence_tag>"
Example         /biological_process="protein folding {GO:0006457}"
Comment         biological_process is defined using a term from
                the Gene Ontology, a controlled vocabulary for
                describing gene products; GO terms are divided
                between 3 primary hierarchies (function, biological_process and
                cellular component);
Data source     GO terms are imported into Genome Reviews files from
                GOA, a database of associations between gene products
                and GO terms; the gene products in GOA are identified
                by UniProtKB IDs and can be mapped to CDS features via
                the cross-references between EMBL and UniProtKB;

Qualifier       /biovar=
Definition      a sub-species level taxonomic characterisation based
                on physiological characters; "biotype" is a synonym of
                "biovar" but biovar is the correct term;
Value format    "text"

Example         /biovar="Orientalis"
Comment         used only with the source feature key;
Data source     a full taxonomic definition for each species in Genome
                Reviews is imported from the HAMAP project and is
                combined with the original taxonomic annotation in the
                parent EMBL record;

Qualifier       /cellular_component=
Definition      cellular component to which the product of this CDS
                has been localised;
Value format    "<GO_term> <evidence_tag>"
Example         /cellular_component="cytoplasm {GO:0005737}"
Comment         the cellular component is defined using a term from
                the Gene Ontology, a controlled vocabulary for
                describing gene products; GO terms are divided
                between 3 primary hierarchies (function, biological_process and
                cellular component);
Data source     GO terms are imported into Genome Reviews files from
                GOA, a database of associations between gene products
                and GO terms; the gene products in GOA are identified
                by UniProtKB IDs and can be mapped to CDS features via
                the cross-references between EMBL and UniProtKB;

Qualifier       /chromosome=
Definition      chromosome (e.g. chromosome number) from which
                the sequence was obtained;
Value format    "text <evidence_tag>"
Comment         used only with the source feature key;
Example         /chromosome="1"

Qualifier       /citation=
Definition      reference to a citation listed in the entry reference field
Value format    [integer-number] where integer-number is the number of the
                reference as enumerated in the reference field
Example         /citation=[1]
Comment         used to indicate the citation providing the claim of and/or
                evidence for a feature; brackets are used for conformity.

Qualifier       /cultivar=
Definition      a cultivated selection from a plant population that
                can be propagated reliably in a prescribed manner;
Value format    "text <evidence_tag>"
Comment         used only with the source feature key;
Example         /cultivar="Columbia"

Qualifier       /db_xref=
Definition      database cross-reference: pointer to related
                information in another database;
Value format    "<database>:<identifier> <evidence_tag>" where
                database is the name of the database containing
                related information, and identifier is the internal
                identifier of the related information according to the
                naming conventions of the cross-referenced database.
Example         /db_xref="InterPro:IPR001957 {UniProtKB/Swiss-Prot:Q8PEH5}"

Comment         the complete list of cross-references currently used
                in Genome Reviews is given in Appendix 4.5 of this
                document;
Data source     the original cross-references in an EMBL entry are
                supplemented by additional cross references obtained
                from corresponding records in the UniProtKB, UniParc and
                GOA databases;

Qualifier       /EC_number=
Definition      Enzyme Commission number for enzyme product of sequence
Value format    "text <evidence_tag>"
Example         /EC_number="6.3.1.1 {UniProtKB/Swiss-Prot:Q8ZJT3}"
Comment         valid values for EC numbers are defined in the list
                prepared by the IUPAC-IUB Commission on Biochemical
                Enzyme Nomenclature
                (published in Enzyme Nomenclature 1984  New York:
                Academic Press (1984) or a more recent revision
                thereof).
Data source     EC numbers in Genome Reviews file are derived from UniProtKB,
                where the original annotation in EMBL records may be
                supplemented, corrected, or deleted by curators; EC_numbers
                describing portions of a protein sequence may be
                transferred to corresponding novel features in the Genome
                Reviews entry subject to sequence agreement;

Qualifier       /evidence=
Definition      evidence supporting the inclusion of a feature (as opposed
                to a feature qualifier) in a Genome Reviews entry;
Value format    "<evidence_tag>"
Example         /evidence="{UniProtKB/Swiss-Prot:Q8ZJT3}"
Comment         the /evidence qualifier is used in Genome Reviews records
                to hold information about the source of the information used
                to attach a novel feature to an entry; the evidence takes
                the form of an evidence tag (note that when tags are
                attached to other qualifiers, they indicate the source of
                the information used to attach that qualifier to a feature);
                where no evidence qualifier is used, it can be assumed that
                the feature was included in the primary source entry in the
                EMBL Nucleotide Sequence Database from which this Genome
                Reviews entry is derived;
Data source     Whatever data source is described in the tag;

Qualifier       /focus
Definition      identifies the primary source of a Genome Reviews entry,
                where there are > 1 source features;
Value format    "text <evidence_tag>"

Example         /focus
Comment         secondary source features may exist, for example in the
                case where an insertion sequence is present in a chromosome;
Data source     the original EMBL entry on which the Genome Reviews entry is
                based;

Qualifier       /function=
Definition      function attributed to a sequence;
Value format    "text <evidence_tag>"
Example         /function="3'-5'-exonuclease activity {GO:0008408}"
Comment         the data stored under the /function qualifier is defined
                using a term from the Gene Ontology, a controlled vocabulary
                for describing gene products; GO terms are divided between 3
                primary hierarchies (function, biological_process and cellular
                component);
Data source     GO terms are imported into Genome Reviews files from
                GOA, a database of associations between gene products
                and GO terms; the gene products in GOA are identified
                by UniProtKB IDs and can be mapped to CDS features via
                the cross-references between EMBL and UniProtKB;

Qualifier       /gene_id=
Definition      a stable Integr8/Genome Reviews gene identifier that
                uniquely identifies a gene
Value format    "text"
Example:        /gene_id="IGI00723232"
Comment         gene_id qualifiers are currently only assigned to protein
                coding genes and are added to CDS features; the format of the
                gene identifier is 'IGI' followed by 8 digits. Please
                note that gene IDs are unique for each gene, not necessarily
                for each coding region; e.g. in case of alternative splicing,
                splice variants of the same gene carry the same gene identifier.
Data source     gene_ids are generated during the Integr8/Genome Reviews
                production pipeline.

Qualifier       /gene_name=
Definition      symbol of the gene corresponding to a sequence region
Value format    "text"

Example         /gene_name="ilvE {UniProtKB/Swiss-Prot:Q8ZJT3}"

Comment         a gene can be considered as a collection of functionally
                features, some of which some may be CDSs (coding sequences),
                and other of which may be promoters, UTRs, mRNAs, etc; in
                EMBL records, the gene feature is typically used to
                mark the span enclosing all such features; this can
                cause problems where genes overlap; in addition, for
                most EMBL records representing complete genome
                sequences, the only feature belonging to each gene
                that has been annotated is a single CDS feature, and
                the gene feature (as used) is redundant with this; therefore,
                in Genome Reviews, the gene feature has been dropped;
                if several features belong to the same gene, this is
                indicated by the qualification of those features with
                identical /gene_name and /locus_tag qualifiers; the
                /gene_name qualifier is used to indicate the primary,
                biologically relevant name for a gene; where other
                names are available, these are indicated using the
                /gene_synonym qualifier; ordered systematic names
                (which do not imply biological function) are stored
                using the /locus_tag qualifier;
Data source     data in gene qualifiers (applied to CDS features) in Genome
                Reviews files is derived from UniProtKB, where the original
                EMBL-derived names may be supplemented, corrected, or
                deleted by curators; data in gene qualifiers (applied to
                tRNA features) may be taken from the parental EMBL entry,
                or derived through sequence analysis;

Qualifier       /gene_synonym=
Definition      symbol of the gene corresponding to a sequence region
Value format    "text"
Example         /gene_synonym="BACA {UniProtKB/Swiss-Prot:Q8PDZ9}"
Comment         where more than one gene name is available, secondary names
                are stored under the /gene_synonym qualifier; /gene_synonym
                qualifiers are only attached to the primary feature derived
                from each gene, and not to secondary features (e.g. this
                qualifier is attached to features such as CDS, rRNA, but
                not features such as  mat_peptide, peptide, which represent
                processed versions of primary translations;
Data source     data in gene_name qualifiers in Genome Reviews files is
                derived from UniProtKB, where the original EMBL-derived
                names may be supplemented, corrected, or deleted by
                curators;

Qualifier       /host=
Definition      natural host from which the sequence was obtained;
Value format    "text"
Comment         added to phage records if absent in the original EMBL
                parent entry and a single host is known from the scientific
                literature. See also /host_range.
Example         /host="Acyrthosiphon pisum"

Data source     the original parent entry of this Genome Reviews entry
                in the EMBL Nucleotide Sequence Database or scientific
                literature;

Qualifier       /host_range=
Definition      (spectrum of) known natural hosts that a species/strain
                can infect;
Value format    "text"
Example         /host_range="Prochlorococcus; Synechococcus"

Comment         added to phage records if absent in the original EMBL
                parent entry and multiple hosts are known from the scientific
                literature. See also /host.
Data source     scientific literature;

Qualifier       /insertion=
Definition      a special type of translational exception: comprises one or
                many amino acids (indicated in single letter code) present
                in a translation where the corresponding codon is not
                present in the underlying nucleotide sequence
Value format    (pos:location,seq:<amino_acids>, where amino_acids is extra
                 residues to be inserted, represented in single letter code
Example         /insertion="531^532,seq:AV"

Comment         insertion qualifiers may be used (in conjunction with join
                statements) where only one nucleotide from a CDS is missing;
                this can be represented as an insertion of one amino acid
                (corresponding to 3 nucleotides), and a gap of 2 nucleotides
                in the coding sequence; amino acids are presented in protein
                coding order; the numerical range (indicated by two
                sequential integers separated by a caret) indicates the
                nucleotides either side of the "missing" codon; amino acids
                to be inserted are given according to their order in the
                protein;
Data source     insertion qualifiers are derived from a mapping process applied
                when comparing reference protein sequences to the genomic DNA
                sequence;

Qualifier       /isolate=
Definition      individual isolate from which the sequence was obtained;
Value format    "text"

Example         /pathovar="Porton"
Comment         used only with the source feature key;
Data source     a full taxonomic definition for each species in Genome
                Reviews is imported from the HAMAP project and is combined with
                the original taxonomic annotation in the parent EMBL record;

Qualifier       /locus_tag=
Definition      a systematic name for a given gene, indicating its
                relative position in the sequence with respect to
                other genes; not indicative of biological function.
Value Format    "text <evidence_tag>"

Example         /locus_tag="RSc0382 {UniProtKB/Swiss-Prot:Q8ZJT3}"

Comment         /locus_tag can be used with any feature where /gene_name is
                valid; /locus_tag values may be used more than once within
                an entry, but always to indicate the same gene; in all other
                circumstances the /locus_tag value must be unique
                within that entry/record; together with the contents
                of the /gene_name qualifier, the /locus_tag qualifier is
                (where known) applied to every feature derived from
                the corresponding gene (see also the discussion on the
                use of /gene, above);
Data source     data in gene qualifiers in Genome Reviews files is
                derived from UniProtKB, where the original EMBL-derived
                locus tags may be supplemented, corrected, or deleted
                by curators;

Qualifier       /ncRNA_class=
Definition	a structured description of the classification of the
                non-coding RNA described by the ncRNA parent key
Value format	"TYPE"
Example	        /ncRNA_class="snoRNA"
Comment         where TYPE is one of the following terms: antisense_RNA, autocatalytically_spliced_intron,
                hammerhead_ribozyme, RNase_P_RNA, RNase_MRP_RNA, telomerase_RNA, guide_RNA, rasiRNA, scRNA,
                siRNA, miRNA, snoRNA, snRNA, SRP_RNA, stRNA, tRNA, vault_RNA, Y_RNA.

Qualifier       /note=
Definition      any comment or additional information
Value format    "text <evidence_tag>"

Example         /note="protein modification {FunCat:14.07}"

Qualifier       /operon_name=
Definition      name of the group of contiguous genes transcribed into a
                single transcript to which that feature belongs.
Value format    "text <evidence_tag>"

Example         /operon_name="thrLABC {RegulonDB:ECK120014725}"
Comment         valid only on Prokaryota-specific features. To accommodate
                regulonDB data, we use the extended regulonDB definition of
                operon, i.e. we allow single-gene operons.
Data source     data in operon qualifiers is derived from regulonDB;

Qualifier       /orf_name=
Definition      A name temporarily attributed by a sequencing project to an
                open reading frame. This name is generally based on a cosmid
                numbering system.
Value format    "text <evidence_tag>"

Example         /orf_name="MTV025.058 {UniProtKB/Swiss-Prot:P96420}"

Data source     data in orf_name qualifiers in Genome Reviews files is
                derived from UniProtKB;

Qualifier       /organism=
Definition      scientific name of the organism that provided the
                sequenced genetic material;
Value format    "text"
Example         /organism="Chlamydophila caviae"

Comment         used only with the source feature key; in Genome Reviews, the
                content of the organism qualifier contains only the genus and
                species of the relevant organism; the complete taxonomic
                specification of the source organism is provided by
                combining the data stored under  the following qualifiers
                applied to the source feature: /biovar, /organism,
                /pathovar, /serovar, /strain, /sub_species, /sub_strain; a fully
                descriptive name based on all these qualifiers is given in
                the OS line of the entry;
Data source     a full taxonomic definition for each species in Genome
                Reviews is imported from the HAMAP project;

Qualifier       /pathovar=
Definition      a strain or set of strains with similar pathogenicity,
                including both host-range and symptomatology;
Value format    "text"
Example         /pathovar="campestris"

Comment         used only with the source feature key;
Data source     a full taxonomic definition for each species in Genome
                Reviews is imported from the HAMAP project and is
                combined with the original taxonomic annotation in the
                parent EMBL record;

Qualifier       /plasmid=
Definition      name of plasmid from which sequence was obtained
Value format    "text"

Example         /plasmid="C-589"

Data source     EMBL or UniProtKB

Qualifier       /product=
Definition      primary name of a product (typically a protein name)
                encoded by a sequence
Value format    "text <evidence_tag>"

Example         /product="DNA polymerase III beta chain {UniProtKB/TrEMBL:Q8PEH4}"

Data source     product names are imported from description lines in UniProtKB
                records; these may apply to complete CDSs or to derived partial
                sequences (e.g. mature peptides); where more than one name is
                available, secondary names are stored under the
                /product_synonym qualifier;

Qualifier       /product_synonym=
Definition      secondary name of a product (typically a protein name)'
                encoded by a sequence
Value format    "text <evidence_tag>"
Example         /product_synonym="DNA polymerase III beta chain {UniProtKB/TrEMBL:Q8PEH4}"

Data source     product names are imported from description lines in UniProtKB
                records; these may apply to complete CDSs or to derived partial
                sequences (e.g. those qualified with /mat_peptide);
                /product_synonym is used to store secondary names where more than
                one name is available;

Qualifier       /promoter=
Definition      name of region on a DNA molecule involved in RNA polymerase
                binding to initiate transcription;
Value format    "<text> <evidence_tag>"

Example         /promoter="thrLp {RegulonDB:ECK120014725}"

Comment         in Genome Reviews, this qualifier is used to uniquely
                define transcription units that are part of an operon;
Data source     data in promoter qualifiers is derived from regulonDB;

Qualifier       /protein_id=
Definition      protein identifier, issued by the International
                Nucleotide Sequence Database collaborators EMBL, Genbank
                and DDBJ; this qualifier consists of a stable ID
                portion (3+5 format with 3 position letters and 5
                numbers) plus a version number after the decimal point;
Value format    "<identifier> <evidence_tag>"
Example         /protein_id="AAA12345.1"

Comment         when the protein sequence encoded by the CDS changes,
                only the version number of the /protein_id value is
                incremented; the stable part of the /protein_id
                remains unchanged and as a result will permanently be
                associated with a given protein; this qualifier is
                valid only on CDS features which translate into a
                valid protein; use of /protein_id in Genome Reviews is
                unchanged from usage in EMBL;
Data source     The original parent entry of this Genome Reviews entry
                in the EMBL Nucleotide Sequence Database;

Qualifier       /proviral
Definition      if the sequence shown is viral and integrated into another
                organism's genome, this qualifier is used to denote that
Value format    none
Example         /proviral

Qualifier       /pseudo
Definition      indicates that this feature is a non-functional
                version of the element named by the feature key
Example         /pseudo
Comment         in Genome Reviews, the pseudo qualifier is used to
                indicate that a CDS is non-coding (a pseudogene).
Data source     EMBL or UniProtKB

Qualifier       /replace=
Definition      indicates that the sequence identified a feature's intervals
                is replaced by the sequence shown in "text"; if no
                sequence is contained within the qualifier, this indicates a
                deletion.
Value format    "text"

Example         /replace="a" /replace=""

Qualifier       /segment=
Definition      name of viral or phage segment sequenced
Value format    "text"
Example         /segment="M"

Qualifier       /serogroup=
Definition      serological variety of a species
                characterised by its antigenic properties; a variety of
                different serovars may belong to a single serogroup;
Value format    "text"
Example         /serogroup="B"

Comment         used only with the source feature key;
Data source     a full taxonomic definition for each species in Genome
                Reviews is imported from the HAMAP project and is
                combined with the original taxonomic annotation in the
                parent EMBL record;

Qualifier       /serovar=
Definition      serological variety of a species
                characterised by its antigenic properties; "serotype" is a
                synonym of "serovar" but serovar is the correct term;
Value format    "text"

Example         /serovar="3"
Comment         used only with the source feature key;
Data source     a full taxonomic definition for each species in Genome
                Reviews is imported from the HAMAP project and is
                combined with the original taxonomic annotation in the
                parent EMBL record.

Qualifier       /strain=
Definition      strain from which sequence was obtained;
Value format    "text"
Example         /strain="NCTC 11168"
Data source     a full taxonomic definition for each species in Genome
                Reviews is imported from the HAMAP project;

Qualifier       /sub_species=
Definition      name of sub-species of organism from which sequence was
                obtained;
Value format    "text"

Example         /sub_species="Acyrthosiphon pisum"

Data source     a full taxonomic definition for each species in Genome
                Reviews is imported from the HAMAP project and is
                combined with the original taxonomic annotation in the
                parent EMBL record;

Qualifier       /sub_strain=
Definition      sub_strain from which sequence was obtained;
Value format    "text"
Example         /sub_strain="abis"

Data source     a full taxonomic definition for each species in Genome
                Reviews is imported from the HAMAP project and is
                combined with the original taxonomic annotation in the
                parent EMBL record;

Qualifier       /translation=
Definition      one-letter abbreviated amino ;
                acid sequence derived from either the universal
                genetic code or the table as specified in
                /transl_table and as determined by exceptions in the
                /transl_except and /codon qualifiers;
Value format    IUPAC one-letter amino acid abbreviation, "X" is to be
                used for AA exceptions;
Example         /translation="MASTFPPWYRGCASTPSLKGLIMCTW"
Comment         to be used with CDS feature only; this is a mandatory
                qualifier to the CDS feature key except for /pseudo
                CDSs; see /transl_table for definition and location of
                genetic code Tables; /translation is only included
                for CDSs with valid translations (i.e. not pseudogenes;
                usage is exclusive with /pseudo);
Data source     at present, the translation is always imported from
                the parent EMBL record of each Genome Reviews entry;

Qualifier       /transl_except=
Definition      translational exception: single codon the translation of which
                does not conform to genetic code defined by Organism and /codon=
Value format    "(pos:location,aa:<amino_acid>)" where amino_acid is the
                amino acid coded by the codon at the base_range position
Example         /transl_except="(pos:213..215,aa:Trp)"

                /transl_except="(pos:1017,aa:TERM)"

                /transl_except="(pos:2000..2001,aa:TERM)"
                /transl_except="(pos:X22222:15..17,aa:Ala)"
Comment         if the amino acid is not on the restricted vocabulary list use
                e.g., '/transl_except="(pos:213..215,aa:OTHER)"' with
                '/note="name of unusual amino acid"';
                for modified amino-acid selenocysteine use three letter code
                'Sec'  (one letter code 'U' in amino-acid sequence)
                /transl_except="(pos:1002..1004,aa:Sec)";
                for partial termination codons where TAA stop codon is
                completed by the addition of 3' A residues to the mRNA
                either a single base_position or a base_range is used, e.g.
                if partial stop codon is a single base:
                /transl_except="(pos:1017,aa:TERM)"

                if partial stop codon consists of two bases:
                /transl_except="(pos:2000..2001,aa:TERM)".
Data source     translation exceptions are either imported from the parent EMBL
                record of each Genome Reviews entry, or derived from a mapping
                process applied when comparing reference protein sequences to
                the genomic DNA sequence.

Qualifier       /transl_table=
Definition      definition of genetic code table used if other than
                universal genetic code table. Tables used are
                described in appendix V of the EMBL feature table
                document,section 7.5.5. (URL:
                http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html);
Value format    integer
Example         /transl_table=4
Comment         genetic code exceptions outside range of specified
                tables are reported in /codon or /transl_except qualifiers;
                1=universal table 1; 2=non-universal
                table 2; etc.; /transl_table is only included for CDSs
                with valid translations (i.e. not pseudogenes; usage is
                exclusive with /pseudo);
Data source     the parent EMBL record of each Genome Reviews entry;

Qualifier	/variety=
Definition      variety (= varietas, a formal Linnaean rank) of organism
                from which sequence was derived.
Value format    "text"
Example         /variety="neoformans"


            



8.3.2 Feature qualifiers - mapped to Feature keys

The following is a list of available qualifiers mapped to the list of feature keys on which each qualifier is legal.
/anticodon                      tRNA
/biological_process             CDS
/biovar                         source
/cellular_component             CDS
/chromosome                     source
/citation                       conflict
/cultivar                       source
/db_xref                        CDS
/db_xref                        mat_peptide
/db_xref                        pro_peptide
/db_xref                        peptide
/db_xref                        sig_peptide
/db_xref                        source
/db_xref                        transit_peptide
/EC_number                      CDS
/EC_number                      mat_peptide
/EC_number                      peptide
/evidence                       mat_peptide
/evidence                       pro_peptide
/evidence                       peptide
/evidence                       sig_peptide
/evidence                       transit_peptide
/focus                          source
/function                       CDS
/gene_id                        CDS
/gene_name                      CDS
/gene_name                      mat_peptide
/gene_name                      pro_peptide
/gene_name                      peptide
/gene_name                      sig_peptide
/gene_name                      transit_peptide
/gene_name                      rRNA
/gene_name                      tRNA
/gene_name                      tmRNA
/gene_name                      ncRNA
/gene_synonym                   CDS
/host                           source
/host_range                     source
/locus_tag                      CDS
/locus_tag                      mat_peptide
/locus_tag                      pro_peptide
/locus_tag                      peptide
/locus_tag                      rRNA
/locus_tag                      sig_peptide
/locus_tag                      transit_peptide
/locus_tag                      tRNA
/mol_type                       source
/ncRNA_class			ncRNA
/note                           CDS
/note                           gap
/orf_name                       CDS
/organism                       source
/operon_name                    CDS
/operon_name                    prim_transcript
/operon_name                    operon
/pathovar                       source
/plasmid                        source
/product                        CDS
/product                        mat_peptide
/product                        pro_peptide
/product                        peptide
/product                        sig_peptide
/product                        transit_peptide
/product_synonym                CDS
/product_synonym                mat_peptide
/product_synonym                pro_peptide
/product_synonym                peptide
/product_synonym                sig_peptide
/product_synonym                transit_peptide
/promoter                       prim_transcript
/protein_id                     CDS
/proviral                       source
/pseudo                         CDS
/pseudo                         tRNA
/replace                        conflict
/segment                        source
/serogroup                      source
/serovar                        source
/strain                         source
/sub_species                    source
/sub_strain                     source
/transl_table                   CDS
/translation                    CDS
/variety                        source


8.4 APPENDIX IV. Full list of all evidence tags currently in Use in Genome Reviews.



Tag:            BLASTALL 2.2.6/ALIGN 2.0u
Comment         In CDS features added after comparing sequences from
                the UniProt Knowledgebase to Genome Reviews DNA sequence,
                applied to the cross-reference to the UniProtKB entry. 

Tag:            tRNAScan-SE-1.23
Comment         Applied to qualifiers of tRNA feature added after running this program to
                predict tRNA-encoding genes for records where these are not
                available from the primary sequence source.

Tag:            Rfam-8.1
Comment         Applied to qualifiers of all non-protein-coding RNA features,
                other than tRNA and rRNA genes and to all RNA motif features 
                added after running the rfam_scan.pl program to predict 
                non-protein-coding RNA genes and RNA motifs for
		records where these are not available from the primary sequence source.

Tag:            RNAmmer-1.2
Comment         Applied to qualifiers of rRNA genes added after running
                RNAmmer version 1.2 to predict ribosomal RNA genes in
		records where these are not available from the primary sequence source.

Tag:            EMBL:accession_number
Comment         Applied to data automatically imported into Genome Reviews
                from a source EMBL entry.

Tag             GOA:accession_number
Comment         GOA is a database of associations between terms in the
                Gene Ontology controlled vocabularies and records in the
                UniProtKB Knowledgebase.  Annotations are retrieved through
                mapping via cross references to UniProtKB present in
                CDS features in the parent EMBL entry.

Tag             GO:id
Comment         Applied to data that follows from the mapping made between
                a feature and a particular GO term via GOA.

Tag             MUMDB:id
Comment         Applied to data that is imported for a feature using its
                identifier in the MUMDB database.

Tag             RefSeq:id
Comment         Applied to data that is imported for a feature using its
                identifier in the RefSeq database.

Tag             SGD:id
Comment         Applied to data that is imported for a feature using its
                identifier in the SGD database.

Tag             SGD genome:id
Comment         Applied to data automatically imported into Genome Reviews
                from a source SGD entry.

Tag             TAIR:id
Comment         Applied to data that is imported for a feature using its
                identifier in the TAIR database.

Tag             TAIR release: release_number
Comment         Applied to data automatically imported into Genome Reviews
                from a given release of TAIR.

Tag             UniProtKB/Swiss-Prot:accession_number
Comment         Applied to data that is imported for a feature using its
                identifier in the UniProtKB/Swiss-Prot database.

Tag             UniProtKB/TrEMBL:accession_number
Comment         Applied to data that is imported for a feature using its
                identifier in the UniProtKB/Swiss-Prot database.

Tag             UniParc:protein_id
Comment         Uniparc is a database of associations between protein
                sequences (identified by the use of a UniParc
                identifier) and records in external databases
                (including CDSs in EMBL records, identified by the use of
                the protein identifier, which can be used to retrieve
                UniParc IDs for each feature).


The presence of an exclamation mark (!) before the database identifier indicates that a deduction has been made from the absence of this identifier from the database in question.

8.5 APPENDIX V: List of cross-references currently included in Genome Reviews




spacer
spacer