Current Release Statistics
UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2017_07 STATISTICS
1. INTRODUCTION
Release 2017_07 of 05-Jul-2017 of UniProtKB/TrEMBL contains 88032926 sequence entries,
comprising 29627301199 amino acids.
1882496 sequences have been added since release 2017_06, the sequence data of
183 existing entries has been updated and the annotations of
14161645 entries have been revised. This represents an increase of 2%.
Number of fragments: 9008702
Protein existence (PE): entries %
1: Evidence at protein level 129607 0.15%
2: Evidence at transcript level 1086717 1.23%
3: Inferred from homology 21458867 24.38%
4: Predicted 65357735 74.24%
5: Uncertain 0 0.00%
The growth of the database is summarized below.
2. TAXONOMIC ORIGIN
Total number of species represented in this release of UniProtKB/TrEMBL: 747492
The first twenty species represent 4627039 sequences: 5.3 % of the
total number of entries.
2.1 Table of the frequency of occurrence of species
Species represented 1x:31364
2x:11445
3x:60802
4x:43296
5x:26192
6x:19146
7x:14289
8x:11177
9x: 9128
10x:14227
11- 20x:68671
21- 50x:20104
51-100x: 9076
>100x:23286
2.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 794093 Human immunodeficiency virus 1
2 668601 marine sediment metagenome
3 506808 Daphnia magna
4 260048 Arundo donax (Giant reed) (Donax arundinaceus)
5 222882 uncultured bacterium
6 212730 Escherichia coli
7 209114 Bacillus cereus
8 165730 Fundulus heteroclitus (Killifish) (Mummichog)
9 164387 Pseudomonas fluorescens
10 153553 Streptococcus pneumoniae
11 145216 Triticum aestivum (Wheat)
12 139338 Homo sapiens (Human)
13 131362 Hepatitis C virus
14 129376 Zea mays (Maize)
15 128576 Helicobacter pylori (Campylobacter pylori)
16 127745 mine drainage metagenome
17 121625 Mycobacterium abscessus subsp. abscessus
18 118924 Oryza sativa subsp. japonica (Rice)
19 114294 Hepatitis B virus (HBV)
20 112637 Pseudomonas putida (Arthrobacter siderocapsulatus)
21 111944 uncultured Clostridium sp
22 111477 groundwater metagenome
23 103712 Anguilla anguilla (European freshwater eel) (Muraena anguilla)
24 100860 Brassica napus (Rape)
25 100062 Glycine max (Soybean) (Glycine hispida)
26 99351 Pyramidula sp. HNHM
2.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 1693366 ( 2%)
Bacteria 57908565 ( 66%)
Eukaryota 24201072 ( 27%)
Viruses 3104959 ( 4%)
Other 1124964 ( <1%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 139413 ( 1%) ( 0%)
Other Mammalia 1359143 ( 6%) ( 2%)
Other Vertebrata 2573332 ( 11%) ( 3%)
Viridiplantae 4743272 ( 20%) ( 5%)
Fungi 6809456 ( 28%) ( 8%)
Insecta 2716349 ( 11%) ( 3%)
Nematoda 1370849 ( 6%) ( 2%)
Other 4489258 ( 19%) ( 5%)
3. SEQUENCE SIZE
Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 1454963 1001-1100 606190
51- 100 7456735 1101-1200 422290
101- 150 8779492 1201-1300 289307
151- 200 8385898 1301-1400 199868
201- 250 8325968 1401-1500 156309
251- 300 8231766 1501-1600 113873
301- 350 7472580 1601-1700 85936
351- 400 5792197 1701-1800 66838
401- 450 4924460 1801-1900 57063
451- 500 4007007 1901-2000 47556
501- 550 2801460 2001-2100 38998
551- 600 2147247 2101-2200 37961
601- 650 1569406 2201-2300 28919
651- 700 1233866 2301-2400 23488
701- 750 1053835 2401-2500 20198
751- 800 914964 >2500 157437
801- 850 697806
851- 900 611262
901- 950 457487
951-1000 353594
The average sequence length in UniProtKB/TrEMBL is 336 amino acids.
The shortest sequence is C4PYW0_SCHMA: 2 amino acids.
The longest sequence is A0A1V4K6M4_P: 36991 amino acids.
4. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 104551643 1.19
Submitted to EMBL/GenBank/DDBJ 65322454 58838812 0.74
Journal 34302016 32261882 0.39
Submitted to other databases 4902835 4879629 0.06
Thesis 13077 13018 <0.01
Book citation 11260 11195 <0.01
Patent 1 1 <0.01
Total number of distinct authors cited in UniProtKB/TrEMBL: 635398
Total Number of Average
Line type / subtype number entries per entry Rank
--------------------------------- -------- --------- --------- ----
Comments (CC) 116538630 1.32
CATALYTIC ACTIVITY 10019180 9184098 0.11 5
CAUTION 45516769 44508380 0.52 1
COFACTOR 4542693 4167797 0.05 8
DOMAIN 611509 588128 0.01 9
ENZYME REGULATION 193228 193226 <0.01 11
FUNCTION 11356368 10854197 0.13 3
INTERACTION 2281 2281 <0.01 12
MISCELLANEOUS 339978 335286 <0.01 10
PATHWAY 5047751 4592838 0.06 7
SIMILARITY 21562424 21301697 0.24 2
SUBCELLULAR LOCATION 11331492 10394013 0.13 4
SUBUNIT 6014957 5939582 0.07 6
Total number of comment topics: 12
Total Number of Average
Line type / subtype number entries per entry Rank
--------------------------------- -------- --------- --------- ----
Features (FT) 214054945 2.43
ACT_SITE 4770023 2907831 0.05 9
BINDING 10349722 2667721 0.12 4
CARBOHYD 1564 899 <0.01 26
CHAIN 5849713 5847591 0.07 7
COILED 5872466 3939474 0.07 6
COMPBIAS 3512 3512 <0.01 24
CROSSLNK 19358 17693 <0.01 21
DISULFID 907457 244632 0.01 15
DNA_BIND 2142993 1890070 0.02 13
DOMAIN 63075002 45566344 0.72 2
INIT_MET 21817 21817 <0.01 20
INTRAMEM 955 745 <0.01 27
LIPID 16182 14556 <0.01 22
METAL 8201373 2205977 0.09 5
MOD_RES 735501 711283 0.01 16
MOTIF 515780 377200 0.01 17
NON_STD 2687 2496 <0.01 25
NON_TER 14040958 9047478 0.16 3
NP_BIND 4538727 2938720 0.05 10
PEPTIDE 73 73 <0.01 29
PROPEP 10928 10928 <0.01 23
REGION 3102961 1624708 0.04 11
REPEAT 2416603 676344 0.03 12
SIGNAL 5845523 5845514 0.07 8
SITE 1142847 644752 0.01 14
TOPO_DOM 88877 29584 <0.01 19
TRANSIT 94 94 <0.01 28
TRANSMEM 80102316 17682979 0.91 1
ZN_FING 278933 211381 <0.01 18
Total number of feature keys: 29
Total Number of Average
Line type / subtype number entries per entry Rank Category
--------------------------------- -------- --------- --------- ---- -------------------------------------------
Cross-references (DR) 1068529605 12.14
Allergome 3874 3143 <0.01 89 Protein family/group databases
ArachnoServer 203 203 <0.01 109 Organism-specific databases
Araport 19694 19610 <0.01 77 Organism-specific databases
BRENDA 9647 9356 <0.01 82 Enzyme and pathway databases
Bgee 359770 359720 <0.01 49 Gene expression databases
BindingDB 224 224 <0.01 108 Chemistry
BioCyc 3465001 3463759 0.04 29 Enzyme and pathway databases
CAZy 129629 121314 <0.01 57 Protein family/group databases
CDD 11044844 10521118 0.13 19 Family and domain databases
CGD 20816 20750 <0.01 76 Organism-specific databases
COMPLUYEAST-2DPAGE 4 4 <0.01 121 2D gel databases
CTD 777639 775758 0.01 41 Organism-specific databases
ChEMBL 871 871 <0.01 101 Chemistry
ChiTaRS 86196 86037 <0.01 61 Other
CollecTF 202 202 <0.01 110 Gene expression databases
ConoServer 160 160 <0.01 111 Organism-specific databases
DIP 3288 3282 <0.01 91 Protein-protein interaction databases
DNASU 41380 40941 <0.01 70 Protocols and materials databases
DrugBank 614 344 <0.01 102 Chemistry
EMBL 96231329 84943349 1.09 3 Sequence databases
EPD 7173 7173 <0.01 85 Proteomic databases
ESTHER 70484 70188 <0.01 63 Protein family/group databases
Ensembl 1226214 1203465 0.01 36 Genome annotation databases
EnsemblBacteria 41082366 38844489 0.47 9 Genome annotation databases
EnsemblFungi 5494382 5343811 0.06 27 Genome annotation databases
EnsemblMetazoa 1074571 1049370 0.01 38 Genome annotation databases
EnsemblPlants 1754979 1643279 0.02 33 Genome annotation databases
EnsemblProtists 1858061 1749167 0.02 32 Genome annotation databases
EuPathDB 583473 583473 0.01 44 Organism-specific databases
EvolutionaryTrace 6024 6024 <0.01 86 Other
ExpressionAtlas 279003 279003 <0.01 51 Gene expression databases
FlyBase 222759 221294 <0.01 55 Organism-specific databases
GO 161879834 57566525 1.84 2 Ontologies
Gene3D 35429612 29824332 0.40 10 Family and domain databases
GeneDB 114837 113058 <0.01 59 Genome annotation databases
GeneID 9791505 9682315 0.11 20 Genome annotation databases
GeneTree 1207053 1206921 0.01 37 Phylogenomic databases
Genevisible 16351 16351 <0.01 79 Gene expression databases
GenomeRNAi 30316 30316 <0.01 73 Other
Gramene 1714552 1608181 0.02 34 Genome annotation databases
GuidetoPHARMACOLOGY 4 4 <0.01 120 Chemistry
H-InvDB 590 443 <0.01 104 Organism-specific databases
HAMAP 8833737 8722252 0.10 21 Family and domain databases
HGNC 50604 50510 <0.01 67 Organism-specific databases
HOGENOM 3046771 3046676 0.03 30 Phylogenomic databases
HOVERGEN 300691 300679 <0.01 50 Phylogenomic databases
InParanoid 2505312 2505207 0.03 31 Phylogenomic databases
IntAct 18676 18676 <0.01 78 Protein-protein interaction databases
InterPro 198116949 69228719 2.25 1 Family and domain databases
KEGG 13362501 12973867 0.15 17 Genome annotation databases
KO 5732883 5708852 0.07 26 Phylogenomic databases
LegioList 2496 2483 <0.01 94 Organism-specific databases
Leproma 1271 1269 <0.01 98 Organism-specific databases
MEROPS 251406 251405 <0.01 53 Protein family/group databases
MGI 59940 59561 <0.01 65 Organism-specific databases
MIM 4 4 <0.01 122 Organism-specific databases
MINT 9753 9752 <0.01 81 Protein-protein interaction databases
MalaCards 9 9 <0.01 117 Organism-specific databases
MaxQB 41696 41696 <0.01 69 Proteomic databases
MoonProt 3 3 <0.01 123 Protein family/group databases
OGP 3 3 <0.01 124 2D gel databases
OMA 6514040 6514033 0.07 25 Phylogenomic databases
OpenTargets 48598 48552 <0.01 68 Organism-specific databases
OrthoDB 14613365 14613334 0.17 14 Phylogenomic databases
PANTHER 14048962 13502209 0.16 16 Family and domain databases
PATRIC 18447027 18446944 0.21 12 Genome annotation databases
PDB 33458 16598 <0.01 71 3D structure databases
PIR 163297 131045 <0.01 56 Sequence databases
PIRSF 7492793 7431056 0.09 23 Family and domain databases
PMAP-CutDB 131 131 <0.01 112 Other
PRIDE 277155 277155 <0.01 52 Proteomic databases
PRINTS 12085699 10895703 0.14 18 Family and domain databases
PRO 2256 2256 <0.01 96 Other
PROSITE 44642340 29647733 0.51 7 Family and domain databases
PaxDb 602041 602041 0.01 43 Proteomic databases
PeptideAtlas 119289 119289 <0.01 58 Proteomic databases
PeroxiBase 2482 2474 <0.01 95 Protein family/group databases
Pfam 86684308 63069635 0.98 4 Family and domain databases
PharmGKB 3154 3154 <0.01 92 Organism-specific databases
PhosphoSitePlus 2236 2236 <0.01 97 PTM databases
PhylomeDB 470686 470686 0.01 48 Phylogenomic databases
PomBase 32 32 <0.01 115 Organism-specific databases
ProDom 1354198 1290045 0.02 35 Family and domain databases
ProMEX 3060 3060 <0.01 93 Proteomic databases
ProteinModelPortal 7582600 7582600 0.09 22 3D structure databases
Proteomes 75513537 72403856 0.86 5 Other
PseudoCAP 4463 4457 <0.01 88 Organism-specific databases
REBASE 32370 32352 <0.01 72 Protein family/group databases
REPRODUCTION-2DPAGE 63 62 <0.01 114 2D gel databases
RGD 25124 23797 <0.01 75 Organism-specific databases
Reactome 241301 87877 <0.01 54 Enzyme and pathway databases
RefSeq 42378104 41415839 0.48 8 Sequence databases
SABIO-RK 605 605 <0.01 103 Enzyme and pathway databases
SFLD 572184 376458 0.01 47 Family and domain databases
SGD 7 7 <0.01 119 Organism-specific databases
SIGNOR 8 8 <0.01 118 Enzyme and pathway databases
SMART 21193511 16126437 0.24 11 Family and domain databases
SMR 1044906 1044906 0.01 39 3D structure databases
STRING 6562153 6562152 0.07 24 Protein-protein interaction databases
SUPFAM 57176943 45250905 0.65 6 Family and domain databases
SWISS-2DPAGE 1 1 <0.01 125 2D gel databases
SignaLink 3819 3819 <0.01 90 Enzyme and pathway databases
SwissLipids 76 76 <0.01 113 Chemistry
SwissPalm 1219 1219 <0.01 99 PTM databases
TAIR 15894 15816 <0.01 80 Organism-specific databases
TCDB 7735 7719 <0.01 84 Protein family/group databases
TIGRFAMs 17959275 16504903 0.20 13 Family and domain databases
TopDownProteomics 283 283 <0.01 107 Proteomic databases
TreeFam 577719 577705 0.01 45 Phylogenomic databases
TubercuList 1005 1004 <0.01 100 Organism-specific databases
UCSC 94109 93914 <0.01 60 Genome annotation databases
UniCarbKB 17 17 <0.01 116 PTM databases
UniGene 717086 617124 0.01 42 Sequence databases
UniPathway 4934836 4583273 0.06 28 Enzyme and pathway databases
VectorBase 572261 553502 0.01 46 Genome annotation databases
WBParaSite 854108 845701 0.01 40 Genome annotation databases
World-2DPAGE 317 312 <0.01 106 2D gel databases
WormBase 65802 65412 <0.01 64 Organism-specific databases
Xenbase 26629 26571 <0.01 74 Organism-specific databases
ZFIN 53008 52353 <0.01 66 Organism-specific databases
dictyBase 7988 7766 <0.01 83 Organism-specific databases
eggNOG 14243014 7138674 0.16 15 Phylogenomic databases
euHCVdb 75267 75264 <0.01 62 Organism-specific databases
iPTMnet 4970 4970 <0.01 87 PTM databases
mycoCLAP 448 448 <0.01 105 Protein family/group databases
Number of explicitly cross-referenced databases: 147
5. AMINO ACID COMPOSITION
5.1 Composition in percent for the complete database
Ala (A) 9.02 Gln (Q) 3.80 Leu (L) 9.85 Ser (S) 6.74
Arg (R) 5.70 Glu (E) 6.16 Lys (K) 5.03 Thr (T) 5.57
Asn (N) 3.91 Gly (G) 7.22 Met (M) 2.39 Trp (W) 1.29
Asp (D) 5.44 His (H) 2.20 Phe (F) 3.93 Tyr (Y) 2.94
Cys (C) 1.23 Ile (I) 5.71 Pro (P) 4.86 Val (V) 6.86
Asx (B) 0 Glx (Z) 0 Xaa (X) 0.05
Legend: gray = aliphatic, red = acidic, green = small hydroxy,
blue = basic, black = aromatic, white = amide, yellow = sulfur
5.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Val, Ser, Glu, Ile, Arg, Thr, Asp, Lys, Pro, Phe, Asn,
Gln, Tyr, Met, His, Trp, Cys
6. MISCELLANEOUS STATISTICS
Total number of entries encoded on a Mitochondrion: 1545716
Total number of entries encoded on a Plasmid: 686317
Total number of entries encoded on a Plastid: 95492
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle:
Total number of entries encoded on a Plastid; Non-photosynthetic plastid:

