Current Release Statistics


         UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2017_06 STATISTICS


1.  INTRODUCTION

Release 2017_06 of 07-Jun-2017 of UniProtKB/TrEMBL contains 87291332 sequence entries,
comprising 29395378505 amino acids.

2641656 sequences have been added since release 2017_05, the sequence data of
3745 existing entries has been updated and the annotations of
57249993 entries have been revised. This represents an increase of 3%.

Number of fragments: 8947736

Protein existence (PE):              entries      %
1: Evidence at protein level          128383     0.15%
2: Evidence at transcript level      1084733     1.24%
3: Inferred from homology           21022078    24.08%
4: Predicted                        65056138    74.53%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.
image



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 741094

   The first twenty species represent 4708878 sequences:   5.4 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x:31072
                            2x:11357
                            3x:60381
                            4x:42900
                            5x:26072
                            6x:19050
                            7x:14180
                            8x:11136
                            9x: 9118
                           10x:14225
                       11- 20x:67835
                       21- 50x:19992
                       51-100x: 8900
                         >100x:23001


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     786240  Human immunodeficiency virus 1
       2     668601  marine sediment metagenome
       3     506808  Daphnia magna
       4     259928  Arundo donax (Giant reed) (Donax arundinaceus)
       5     243097  Bacillus cereus
       6     220510  uncultured bacterium
       7     217537  Escherichia coli
       8     172555  Pseudomonas fluorescens
       9     165729  Fundulus heteroclitus (Killifish) (Mummichog)
      10     164837  Mycobacterium abscessus subsp. abscessus
      11     155504  Streptococcus pneumoniae
      12     145186  Triticum aestivum (Wheat)
      13     139538  Homo sapiens (Human)
      14     131085  Hepatitis C virus
      15     129712  Helicobacter pylori (Campylobacter pylori)
      16     129550  Zea mays (Maize)
      17     127745  mine drainage metagenome
      18     118981  Oryza sativa subsp. japonica (Rice)
      19     113791  Hepatitis B virus (HBV)
      20     111944  uncultured Clostridium sp
      21     111477  groundwater metagenome
      22     103710  Anguilla anguilla (European freshwater eel) (Muraena anguilla)
      23     102662  Pseudomonas putida (Arthrobacter siderocapsulatus)
      24     100860  Brassica napus (Rape)
      25     100049  Glycine max (Soybean) (Glycine hispida)
      26      98861  Pyramidula sp. HNHM
   2.3  Taxonomic distribution of the sequences

image

   Kingdom        sequences (% of the database)
    Archaea         1685416 (  2%)
    Bacteria       57386566 ( 66%)
    Eukaryota      24026613 ( 28%)
    Viruses         3067930 (  4%)
    Other           1124807 ( <1%)



   Within Eukaryota:

image

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 139613 (  1%)           (  0%)
     Other Mammalia       1358306 (  6%)           (  2%)
     Other Vertebrata     2565796 ( 11%)           (  3%)
     Viridiplantae        4726596 ( 20%)           (  5%)
     Fungi                6714782 ( 28%)           (  8%)
     Insecta              2709238 ( 11%)           (  3%)
     Nematoda             1370776 (  6%)           (  2%)
     Other                4441506 ( 18%)           (  5%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50 1453672             1001-1100   603015
                 51- 100 7390142             1101-1200   419873
                101- 150 8697684             1201-1300   287644
                151- 200 8310125             1301-1400   198521
                201- 250 8246403             1401-1500   155571
                251- 300 8150749             1501-1600   113183
                301- 350 7404953             1601-1700    85376
                351- 400 5738429             1701-1800    66466
                401- 450 4881544             1801-1900    56579
                451- 500 3974323             1901-2000    47304
                501- 550 2776605             2001-2100    38711
                551- 600 2131050             2101-2200    37673
                601- 650 1559311             2201-2300    28658
                651- 700 1226181             2301-2400    23356
                701- 750 1046516             2401-2500    20078
                751- 800  908638             >2500       157098
                801- 850  694136
                851- 900  606957
                901- 950  455349
                951-1000  351723

image


   The average sequence length in UniProtKB/TrEMBL is   336 amino acids.

   The shortest sequence is C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  A0A1V4K6M4_P: 36991 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   103756496                1.19                                                    
   Submitted to EMBL/GenBank/DDBJ  64892404  58302347      0.74                                                    
   Journal                         33951144  31883135      0.39                                                    
   Submitted to other databases     4889616   4866360      0.06                                                    
   Thesis                             12071     12012     <0.01                                                    
   Book citation                      11260     11195     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 632414


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     114188144                1.31                                                    
   CATALYTIC ACTIVITY               9793168   8972610      0.11     5                                              
   CAUTION                         44950512  43981235      0.51     1                                              
   COFACTOR                         4339306   3979909      0.05     8                                              
   DOMAIN                            601537    578487      0.01     9                                              
   ENZYME REGULATION                 189892    189890     <0.01    11                                              
   FUNCTION                        10950784  10593934      0.13     4                                              
   INTERACTION                         2296      2296     <0.01    12                                              
   MISCELLANEOUS                     333538    329019     <0.01    10                                              
   PATHWAY                          4950617   4504262      0.06     7                                              
   SIMILARITY                      21134363  20878921      0.24     2                                              
   SUBCELLULAR LOCATION            11092692  10206925      0.13     3                                              
   SUBUNIT                          5849439   5818126      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     210240436                2.41                                                    
   ACT_SITE                         4610518   2812547      0.05     9                                              
   BINDING                         10025438   2560596      0.11     4                                              
   CARBOHYD                             958       426     <0.01    26                                              
   CHAIN                            5848586   5846476      0.07     6                                              
   COILED                           5767228   3873631      0.07     8                                              
   COMPBIAS                            3467      3467     <0.01    24                                              
   CROSSLNK                           18972     17350     <0.01    21                                              
   DISULFID                          825613    225259      0.01    15                                              
   DNA_BIND                         2116621   1866093      0.02    13                                              
   DOMAIN                          61885769  44714565      0.71     2                                              
   INIT_MET                           21415     21415     <0.01    20                                              
   INTRAMEM                             919       721     <0.01    27                                              
   LIPID                              16014     14372     <0.01    22                                              
   METAL                            7951981   2139172      0.09     5                                              
   MOD_RES                           725608    699329      0.01    16                                              
   MOTIF                             506616    370619      0.01    17                                              
   NON_STD                             2592      2401     <0.01    25                                              
   NON_TER                         13945202   8986462      0.16     3                                              
   NP_BIND                          4444417   2877744      0.05    10                                              
   PEPTIDE                               76        76     <0.01    29                                              
   PROPEP                             10538     10538     <0.01    23                                              
   REGION                           3000006   1571621      0.03    11                                              
   REPEAT                           2351070    660762      0.03    12                                              
   SIGNAL                           5844420   5844411      0.07     7                                              
   SITE                             1095087    610914      0.01    14                                              
   TOPO_DOM                           90356     29516     <0.01    19                                              
   TRANSIT                               94        94     <0.01    28                                              
   TRANSMEM                        78857564  17404521      0.90     1                                              
   ZN_FING                           273291    206932     <0.01    18                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             1059952924               12.14                                                    
   Allergome                           3874      3143     <0.01    90   Protein family/group databases             
   ArachnoServer                        203       203     <0.01   110   Organism-specific databases                
   Araport                            19745     19661     <0.01    79   Organism-specific databases                
   BRENDA                              9652      9361     <0.01    83   Enzyme and pathway databases               
   Bgee                              359970    359919     <0.01    49   Gene expression databases                  
   BindingDB                            188       188     <0.01   111   Chemistry                                  
   BioCyc                           3484754   3483511      0.04    29   Enzyme and pathway databases               
   CAZy                              129629    121314     <0.01    57   Protein family/group databases             
   CDD                             10667821  10158127      0.12    19   Family and domain databases                
   CGD                                20816     20750     <0.01    78   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   122   2D gel databases                           
   CTD                               745329    743549      0.01    41   Organism-specific databases                
   ChEMBL                               871       871     <0.01   102   Chemistry                                  
   ChiTaRS                            86279     86120     <0.01    61   Other                                      
   CollecTF                             203       203     <0.01   109   Gene expression databases                  
   ConoServer                           160       160     <0.01   112   Organism-specific databases                
   DIP                                 3292      3286     <0.01    92   Protein-protein interaction databases      
   DNASU                              41388     40949     <0.01    69   Protocols and materials databases          
   DrugBank                             540       318     <0.01   105   Chemistry                                  
   EMBL                            98572102  84215245      1.13     3   Sequence databases                         
   EPD                                 7191      7191     <0.01    86   Proteomic databases                        
   ESTHER                             70761     70465     <0.01    63   Protein family/group databases             
   Ensembl                          1227760   1204798      0.01    36   Genome annotation databases                
   EnsemblBacteria                 41298018  39061612      0.47     9   Genome annotation databases                
   EnsemblFungi                     5494391   5343820      0.06    27   Genome annotation databases                
   EnsemblMetazoa                   1061274   1036111      0.01    38   Genome annotation databases                
   EnsemblPlants                    1758832   1644444      0.02    34   Genome annotation databases                
   EnsemblProtists                  1858063   1749169      0.02    32   Genome annotation databases                
   EuPathDB                          583482    583482      0.01    45   Organism-specific databases                
   EvolutionaryTrace                   6027      6027     <0.01    87   Other                                      
   ExpressionAtlas                   235936    235928     <0.01    54   Gene expression databases                  
   FlyBase                           222771    221306     <0.01    55   Organism-specific databases                
   GO                             160745420  57328450      1.84     2   Ontologies                                 
   Gene3D                          34654276  29170362      0.40    10   Family and domain databases                
   GeneDB                            114837    113058     <0.01    59   Genome annotation databases                
   GeneID                           9213582   9105561      0.11    20   Genome annotation databases                
   GeneTree                         1207099   1206968      0.01    37   Phylogenomic databases                     
   Genevisible                        16361     16361     <0.01    80   Gene expression databases                  
   GenomeRNAi                         30328     30328     <0.01    74   Other                                      
   Gramene                          1758869   1644477      0.02    33   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   121   Chemistry                                  
   H-InvDB                              590       443     <0.01   104   Organism-specific databases                
   HAMAP                            8657552   8548187      0.10    21   Family and domain databases                
   HGNC                               50792     50693     <0.01    67   Organism-specific databases                
   HOGENOM                          3046915   3046820      0.03    30   Phylogenomic databases                     
   HOVERGEN                          300747    300735     <0.01    50   Phylogenomic databases                     
   InParanoid                       2527421   2527316      0.03    31   Phylogenomic databases                     
   IntAct                             24412     24412     <0.01    77   Protein-protein interaction databases      
   InterPro                       194313862  67990044      2.23     1   Family and domain databases                
   KEGG                            13397794  12987412      0.15    17   Genome annotation databases                
   KO                               5740619   5716550      0.07    26   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    95   Organism-specific databases                
   Leproma                             1271      1269     <0.01    99   Organism-specific databases                
   MEROPS                            252079    252078     <0.01    52   Protein family/group databases             
   MGI                                60057     59682     <0.01    65   Organism-specific databases                
   MIM                                    4         4     <0.01   123   Organism-specific databases                
   MINT                                9762      9761     <0.01    82   Protein-protein interaction databases      
   MalaCards                              9         9     <0.01   118   Organism-specific databases                
   MaxQB                              39944     39944     <0.01    70   Proteomic databases                        
   MoonProt                               3         3     <0.01   124   Protein family/group databases             
   OGP                                    3         3     <0.01   125   2D gel databases                           
   OMA                              6524810   6524803      0.07    25   Phylogenomic databases                     
   OpenTargets                        48875     48824     <0.01    68   Organism-specific databases                
   OrthoDB                         14648061  14648030      0.17    14   Phylogenomic databases                     
   PANTHER                         13781932  13245482      0.16    16   Family and domain databases                
   PATRIC                          18556168  18556090      0.21    12   Genome annotation databases                
   PDB                                33137     16452     <0.01    71   3D structure databases                     
   PDBsum                             32767     16243     <0.01    72   3D structure databases                     
   PIR                               163350    131096     <0.01    56   Sequence databases                         
   PIRSF                            7281956   7221461      0.08    23   Family and domain databases                
   PMAP-CutDB                           131       131     <0.01   113   Other                                      
   PRIDE                             277254    277254     <0.01    51   Proteomic databases                        
   PRINTS                          11888444  10716432      0.14    18   Family and domain databases                
   PRO                                 2259      2259     <0.01    97   Other                                      
   PROSITE                         43834632  29113734      0.50     8   Family and domain databases                
   PaxDb                             602348    602348      0.01    44   Proteomic databases                        
   PeptideAtlas                      119460    119460     <0.01    58   Proteomic databases                        
   PeroxiBase                          2483      2475     <0.01    96   Protein family/group databases             
   Pfam                            84870018  61860174      0.97     4   Family and domain databases                
   PharmGKB                            3154      3154     <0.01    93   Organism-specific databases                
   PhosphoSitePlus                     2243      2243     <0.01    98   PTM databases                              
   PhylomeDB                         470750    470750      0.01    48   Phylogenomic databases                     
   PomBase                               32        32     <0.01   116   Organism-specific databases                
   ProDom                           1334682   1271160      0.02    35   Family and domain databases                
   ProMEX                              3061      3061     <0.01    94   Proteomic databases                        
   ProteinModelPortal               7649159   7649159      0.09    22   3D structure databases                     
   Proteomes                       73628689  70846731      0.84     5   Other                                      
   PseudoCAP                           4465      4459     <0.01    89   Organism-specific databases                
   REBASE                             32420     32404     <0.01    73   Protein family/group databases             
   REPRODUCTION-2DPAGE                   64        63     <0.01   115   2D gel databases                           
   RGD                                25163     23835     <0.01    76   Organism-specific databases                
   Reactome                          241471     87961     <0.01    53   Enzyme and pathway databases               
   RefSeq                          44507307  43202643      0.51     7   Sequence databases                         
   SABIO-RK                             599       599     <0.01   103   Enzyme and pathway databases               
   SFLD                              559811    368303      0.01    47   Family and domain databases                
   SGD                                    7         7     <0.01   119   Organism-specific databases                
   SIGNOR                                 7         7     <0.01   120   Enzyme and pathway databases               
   SMART                           20775713  15817210      0.24    11   Family and domain databases                
   SMR                              1039697   1039697      0.01    39   3D structure databases                     
   STRING                           7200474   7200264      0.08    24   Protein-protein interaction databases      
   SUPFAM                          56050283  44368799      0.64     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   126   2D gel databases                           
   SignaLink                           3823      3823     <0.01    91   Enzyme and pathway databases               
   SwissLipids                           74        74     <0.01   114   Chemistry                                  
   SwissPalm                           1220      1220     <0.01   100   PTM databases                              
   TAIR                               15933     15855     <0.01    81   Organism-specific databases                
   TCDB                                7725      7709     <0.01    85   Protein family/group databases             
   TIGRFAMs                        17610636  16181380      0.20    13   Family and domain databases                
   TopDownProteomics                    283       283     <0.01   108   Proteomic databases                        
   TreeFam                           577776    577762      0.01    46   Phylogenomic databases                     
   TubercuList                         1005      1004     <0.01   101   Organism-specific databases                
   UCSC                               94411     94211     <0.01    60   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   117   PTM databases                              
   UniGene                           717362    617371      0.01    42   Sequence databases                         
   UniPathway                       4604310   4276445      0.05    28   Enzyme and pathway databases               
   VectorBase                        608339    554528      0.01    43   Genome annotation databases                
   WBParaSite                        854121    845711      0.01    40   Genome annotation databases                
   World-2DPAGE                         317       312     <0.01   107   2D gel databases                           
   WormBase                           65783     65393     <0.01    64   Organism-specific databases                
   Xenbase                            26635     26577     <0.01    75   Organism-specific databases                
   ZFIN                               53011     52356     <0.01    66   Organism-specific databases                
   dictyBase                           7988      7766     <0.01    84   Organism-specific databases                
   eggNOG                          14285688   7160145      0.16    15   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    62   Organism-specific databases                
   iPTMnet                             4981      4981     <0.01    88   PTM databases                              
   mycoCLAP                             448       448     <0.01   106   Protein family/group databases             

Number of explicitly cross-referenced databases: 147


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.00   Gln (Q) 3.81   Leu (L) 9.85   Ser (S) 6.74
   Arg (R) 5.69   Glu (E) 6.16   Lys (K) 5.05   Thr (T) 5.57
   Asn (N) 3.92   Gly (G) 7.21   Met (M) 2.39   Trp (W) 1.29
   Asp (D) 5.43   His (H) 2.20   Phe (F) 3.93   Tyr (Y) 2.94
   Cys (C) 1.23   Ile (I) 5.71   Pro (P) 4.85   Val (V) 6.85

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.05

image

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Ile, Arg, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 1532998
Total number of entries encoded on a Plasmid: 670397
Total number of entries encoded on a Plastid: 92741
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: