0%

Sequence data

UniProtKB sequences

Most of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the ENA/GenBank/DDBJ nucleotide sequence resources of the International Nucleotide Sequence Database Collaboration (INSDC). These CDS are either generated by gene prediction programs or are experimentally proven. The translated CDS sequences are automatically transferred to the TrEMBL section of UniProtKB. The UniProtKB/TrEMBL records may eventually be selected for manual annotation and then integrated into the UniProtKB/Swiss-Prot section.

In addition to translated CDS, UniProtKB protein sequences may come from:

  • The PDB database of protein structures
  • Sequences experimentally obtained by direct protein sequencing and submitted to UniProt
  • Sequences scanned from the literature
  • Sequences derived from gene prediction but which have not been submitted to ENA/GenBank/DDBJ. These are imported from resources such as Ensembl and RefSeq

Importing and combining sequences from a range of sources means that UniProt provides a complete collection of protein sequences and contributes to consistency of protein sets across various sequence resources (Figure 3).

UniProt sequence sources include INSDC databases, PDB, Ensembl, RefSeq, direct submissions and sequences from the literature.
Figure 3 UniProt imports sequences from a range of sources to ensure that you have access to a complete collection of protein sequences.