Feature level products

Feature-level products contain nucleotide sequence and related annotation derived from submitted ENA assembled and annotated sequences. Data are distributed in flatfile format, similar to that of parent ENA records, with each flatfile representing a single feature. ENA assembled and annotated sequence flatfile format is discussed fully here and differences between feature-level flatfile formats are detailed for each feature-level product (see below). Feature-level products are distributed by FTP and individual records (other than for the spacer product) are available through ENA search and retrieval services. The following feature-level products are available:

  1. The coding product, comprising comprehensive coding sequences,
  2. the non-coding product, comprising all rRNA, tRNA, tmRNA, ncRNA, misc_RNA and precursor_RNA features,
  3. the rRNA product, comprising comprehensive ribosomal RNA loci (i.e. a subset of the non-coding product), and
  4. the spacer product, comprising ETS, IGS and ITS features described using misc_RNA feature annotation.

Data access

Further information and access:

FTP organisation

In the above table, FTP path is provided as an entry point. Organisation is as follows: a release directory provides all data as available at time of most recent ENA quarterly release while an update directory provides data for records added or modified since most recent ENA release; within release and update directories, directories contain records according to dataclass (stdcontsa and wgs); data files are gzipped concatenations of flatfile records, partitioned according to dataclass and taxonomic division and named according to release or update content, dataclass, taxonomic division, number in series and release number; each dataclass-level directory also contains a fasta director, itself providing FASTA format representations of the relevant records, with similar naming scheme.

Differences to parent ENA records

Please refer to the ENA assembled and annotated sequence flatfile format description, the Feature Table Definitions and note the differences below.

ID Line

The format of this line is identical to that of the parent ENA entry with the sequence version, molecule type, data class and division being copied from those of the parent entry. Note that all features are non-segmented as they are sub-sequences. Note also that the length is the length of the sequence of the feature and not of the entry from which it was derived.

For the coding product, the ID is the protein ID assigned to the CDS feature from which the record is derived.

For non-coding, rRNA and spacer records, where a feature-level ID has not previously existed, the ID, e.g. AB012758.1:1..40:tRNA, has a complex format to ensure that it is unique and unambiguous. The structure of the ID may be represented as: <accession>.<sequence version>:<feature location string>:<feature name>[:ordinal].

So the example, AB012758.1:1..40:tRNA is derived from:

  • accession: AB012758
  • sequence version: 1
  • feature location string: 1..40
  • feature name: tRNA.

The ordinal at the end is only used if the entry has two different features of the same type and at the same location.

Description (DE line)

The DE line, e.g. "Homo sapiens (human) partial tRNA-Leu", is built from the entry's primary source feature's taxon, followed by the the common name of this taxon in brackets, followed by "partial" if the location of the feature is partial, finally followed by a description of the feature. This description is derived as follows:

  1. If the feature has a product qualifier, then this is used
  2. Otherwise if the feature has a gene, then this is used
  3. Otherwise the name of the feature (using "miscellaneous RNA" for misc_RNA features) is followed by the processed contents of the note qualifier (if there is one)

The rules processing notes depends on the feature type:

  1. tRNA features: if the note starts with "tRNA" or "Pseudo tRNA", then the entire contents of the note are appended
  2. rRNA features: if the note ends with "ribosomal RNA", then the entire contents of the note are appended
  3. misc_RNA: if the note starts with "ITS", "internal transcribed", "contains", "may contain" or "sequence contains", then the entire contents of the note are appended
  4. If none of the above is found but the note contains "as predicted by Rfam" then the part of the note before this string is appended

Keywords (KW line)

All keywords from the parent entry that exactly match an item in the following list are included:

  • "BARCODE",
  • "CAGE (Cap Analysis Gene Expression)",
  • "CAP trapper",
  • "ENV",
  • "WGS",
  • "EST", "expressed sequence tag", "EST (expressed sequence tag)", "3'-end sequence (3'-EST)", "5'-end sequence (5'-EST)",
  • "FLI_CDNA",
  • "GSS", "genome survey sequence",
  • "HTC",
  • "oligo capping",
  • "RNAcentral",
  • "STS", "sequence tagged site", "STS (sequence tagged site)",
  • "TPA", "Third Party Annotation", "TPA:experimental", "TPA:inferential", "TPA:reassembly", "TPA:specialist_db",
  • "TSA", "Transcriptome Shotgun Assembly",

Citations (RA, RP, RA, RT, RX and RL lines)

All citations from the entry within the range of the feature are include. If the parent entry has citations without a range these are also included.

PA, OS, OC, OG and DT lines

These are derived directly from the entry and should be identical to those occurring in the parent entry.

Sequence (SQ line)

Only the sequence of the feature is included.

