Current Release Statistics


         UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2018_01 STATISTICS


1.  INTRODUCTION

Release 2018_01 of 31-Jan-2018 of UniProtKB/TrEMBL contains 107627435 sequence entries,
comprising 36161263380 amino acids.

6281793 sequences have been added since release 2017_12, the sequence data of
53 existing entries has been updated and the annotations of
20601072 entries have been revised. This represents an increase of 6%.

Number of fragments: 10122293

Protein existence (PE):              entries      %
1: Evidence at protein level          141573     0.13%
2: Evidence at transcript level      1139668     1.06%
3: Inferred from homology           25877510    24.04%
4: Predicted                        80468684    74.77%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.
image



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 792809

   The first twenty species represent 5480427 sequences:   5.1 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 338467
                            2x: 117983
                            3x:  62751
                            4x:  44636
                            5x:  27074
                            6x:  19584
                            7x:  14683
                            8x:  11515
                            9x:   9412
                           10x:  14659
                       11- 20x:  73964
                       21- 50x:  20667
                       51-100x:  10002
                         >100x:  27412


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     833429  Human immunodeficiency virus 1
       2     668601  marine sediment metagenome
       3     506809  Daphnia magna
       4     452292  Bacillus cereus
       5     266409  Escherichia coli
       6     261883  Euryarchaeota archaeon
       7     260019  Arundo donax (Giant reed) (Donax arundinaceus)
       8     245203  uncultured bacterium
       9     218391  Helicobacter pylori (Campylobacter pylori)
      10     217885  Gammaproteobacteria bacterium
      11     207272  Hordeum vulgare subsp. vulgare (Domesticated barley)
      12     177107  Pseudomonas fluorescens
      13     165735  Fundulus heteroclitus (Killifish) (Mummichog)
      14     156465  Streptococcus pneumoniae
      15     145554  Triticum aestivum (Wheat)
      16     145301  Rhodospirillaceae bacterium
      17     144554  Hepatitis C virus
      18     141308  Homo sapiens (Human)
      19     136815  Pseudomonas putida (Arthrobacter siderocapsulatus)
      20     129395  Zea mays (Maize)
      21     127745  mine drainage metagenome
      22     121960  Bacillus thuringiensis
      23     121109  Mycobacterium abscessus subsp. abscessus
      24     118740  Oryza sativa subsp. japonica (Rice)
      25     118197  Enterobacter cloacae
      26     118047  Hepatitis B virus (HBV)
      27     111944  uncultured Clostridium sp
      28     111477  groundwater metagenome
      29     103726  Anguilla anguilla (European freshwater eel) (Muraena anguilla)
      30     100110  Glycine max (Soybean) (Glycine hispida)
      31      98860  Pyramidula sp. HNHM
   2.3  Taxonomic distribution of the sequences

image

   Kingdom        sequences (% of the database)
    Archaea         2366974 (  2%)
    Bacteria       74230112 ( 69%)
    Eukaryota      26561971 ( 25%)
    Viruses         3336039 (  3%)
    Other           1132339 ( <1%)



   Within Eukaryota:

image

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 141383 (  1%)           (  0%)
     Other Mammalia       1430287 (  5%)           (  1%)
     Other Vertebrata     2828113 ( 11%)           (  3%)
     Viridiplantae        5276554 ( 20%)           (  5%)
     Fungi                7605936 ( 29%)           (  7%)
     Insecta              2874250 ( 11%)           (  3%)
     Nematoda             1536712 (  6%)           (  1%)
     Other                4868736 ( 18%)           (  5%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50  1593863             1001-1100    732157
                 51- 100  9025958             1101-1200    507761
                101- 150 10810278             1201-1300    342490
                151- 200 10421284             1301-1400    233387
                201- 250 10371674             1401-1500    182619
                251- 300 10310917             1501-1600    133198
                301- 350  9358386             1601-1700     99642
                351- 400  7249379             1701-1800     77121
                401- 450  6160350             1801-1900     66131
                451- 500  4963478             1901-2000     54537
                501- 550  3444682             2001-2100     44586
                551- 600  2627235             2101-2200     43272
                601- 650  1923073             2201-2300     33037
                651- 700  1506927             2301-2400     27344
                701- 750  1283327             2401-2500     23346
                751- 800  1108008             >2500        180141
                801- 850   848118
                851- 900   735359
                901- 950   555681
                951-1000   426396

image


   The average sequence length in UniProtKB/TrEMBL is   335 amino acids.

   The shortest sequence is     C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  A0A1V4K6M4_PATFA: 36991 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   126125847                1.17                                                    
   Submitted to EMBL/GenBank/DDBJ  84374670  76549278      0.78                                                    
   Journal                         36567815  34292474      0.34                                                    
   Submitted to other databases     5159013   5134750      0.05                                                    
   Thesis                             13040     12981     <0.01                                                    
   Book citation                      11308     11243     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 660779


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     143817070                1.34                                                    
   CATALYTIC ACTIVITY              12056025  10985273      0.11     5                                              
   CAUTION                         58189743  56965337      0.54     1                                              
   COFACTOR                         5551703   5071167      0.05     8                                              
   DOMAIN                           1096284    824067      0.01     9                                              
   ENZYME REGULATION                 298508    298506     <0.01    11                                              
   FUNCTION                        14121316  13305760      0.13     3                                              
   INTERACTION                         2959      2959     <0.01    12                                              
   MISCELLANEOUS                     637308    570088      0.01    10                                              
   PATHWAY                          6162828   5558049      0.06     7                                              
   SIMILARITY                      26020914  25681633      0.24     2                                              
   SUBCELLULAR LOCATION            12392055  12282943      0.12     4                                              
   SUBUNIT                          7287427   7189984      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     270468508                2.51                                                    
   ACT_SITE                         6141541   3772011      0.06     9                                              
   BINDING                         13205159   3405419      0.12     5                                              
   CARBOHYD                           19187     18151     <0.01    23                                              
   CHAIN                            7678071   7675641      0.07     7                                              
   COILED                          15582184  10390945      0.14     3                                              
   COMPBIAS                            4254      4254     <0.01    25                                              
   CROSSLNK                           28077     26290     <0.01    22                                              
   DISULFID                         1924164    511284      0.02    15                                              
   DNA_BIND                         2741934   2428658      0.03    13                                              
   DOMAIN                          74060284  53510597      0.69     2                                              
   INIT_MET                           48895     48895     <0.01    21                                              
   INTRAMEM                            1028       818     <0.01    27                                              
   LIPID                             339135    194665     <0.01    19                                              
   METAL                           10727554   2859689      0.10     6                                              
   MOD_RES                          2155353   1872733      0.02    14                                              
   MOTIF                            1260375    878619      0.01    17                                              
   NON_STD                             3962      3770     <0.01    26                                              
   NON_TER                         15475425  10165082      0.14     4                                              
   NP_BIND                          5778060   3675879      0.05    10                                              
   PEPTIDE                              628       381     <0.01    28                                              
   PROPEP                             14489     14489     <0.01    24                                              
   REGION                           4639296   2349492      0.04    11                                              
   REPEAT                           3982681    959687      0.04    12                                              
   SIGNAL                           7673279   7673269      0.07     8                                              
   SITE                             1746728   1068887      0.02    16                                              
   TOPO_DOM                          298985    139441     <0.01    20                                              
   TRANSIT                              102       102     <0.01    29                                              
   TRANSMEM                        94574414  20731862      0.88     1                                              
   ZN_FING                           363264    286180     <0.01    18                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             1280065050               11.89                                                    
   Allergome                           3889      3146     <0.01    90   Protein family/group databases             
   ArachnoServer                        201       201     <0.01   110   Organism-specific databases                
   Araport                            19473     19389     <0.01    79   Organism-specific databases                
   BRENDA                              9616      9323     <0.01    84   Enzyme and pathway databases               
   Bgee                              546752    546598      0.01    48   Gene expression databases                  
   BindingDB                            200       200     <0.01   111   Chemistry                                  
   BioCyc                           3444977   3443740      0.03    29   Enzyme and pathway databases               
   CAZy                              129409    121103     <0.01    57   Protein family/group databases             
   CDD                             18417037  16219717      0.17    14   Family and domain databases                
   CGD                                20814     20748     <0.01    78   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   127   2D gel databases                           
   CORUM                                117       117     <0.01   115   Protein-protein interaction databases      
   CTD                               902285    900365      0.01    41   Organism-specific databases                
   ChEMBL                               886       886     <0.01   103   Chemistry                                  
   ChiTaRS                            86083     85924     <0.01    61   Other                                      
   CollecTF                             200       200     <0.01   112   Gene expression databases                  
   ConoServer                           160       160     <0.01   113   Organism-specific databases                
   DIP                                 3231      3230     <0.01    92   Protein-protein interaction databases      
   DNASU                              41344     40905     <0.01    69   Protocols and materials databases          
   DisProt                               95        95     <0.01   117   3D structure databases                     
   DrugBank                             640       355     <0.01   104   Chemistry                                  
   ELM                                  116       116     <0.01   116   Protein-protein interaction databases      
   EMBL                           118005380 104344252      1.10     3   Sequence databases                         
   EPD                                14177     14177     <0.01    82   Proteomic databases                        
   ESTHER                             75343     75061     <0.01    62   Protein family/group databases             
   Ensembl                          1285811   1246197      0.01    36   Genome annotation databases                
   EnsemblBacteria                 40464972  38198886      0.38    10   Genome annotation databases                
   EnsemblFungi                     6221203   6114889      0.06    27   Genome annotation databases                
   EnsemblMetazoa                   1098664   1071190      0.01    39   Genome annotation databases                
   EnsemblPlants                    1977743   1810923      0.02    33   Genome annotation databases                
   EnsemblProtists                  1893586   1780545      0.02    34   Genome annotation databases                
   EuPathDB                          648381    648231      0.01    44   Organism-specific databases                
   EvolutionaryTrace                   5996      5996     <0.01    87   Other                                      
   ExpressionAtlas                   628813    628811      0.01    45   Gene expression databases                  
   FlyBase                           222651    221278     <0.01    55   Organism-specific databases                
   GO                             189560296  67760658      1.76     2   Ontologies                                 
   Gene3D                          46307258  38832085      0.43     8   Family and domain databases                
   GeneCards                           1524      1504     <0.01    99   Organism-specific databases                
   GeneDB                            114830    113050     <0.01    59   Genome annotation databases                
   GeneID                          10487140  10378940      0.10    21   Genome annotation databases                
   GeneTree                         1233479   1233351      0.01    37   Phylogenomic databases                     
   Genevisible                        15901     15894     <0.01    80   Gene expression databases                  
   GenomeRNAi                         30227     30227     <0.01    75   Other                                      
   Gramene                          1977743   1810923      0.02    32   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   125   Chemistry                                  
   H-InvDB                              590       443     <0.01   106   Organism-specific databases                
   HAMAP                           11508822  11378178      0.11    20   Family and domain databases                
   HGNC                               50779     50684     <0.01    67   Organism-specific databases                
   HOGENOM                          3036137   3036047      0.03    30   Phylogenomic databases                     
   HOVERGEN                          300580    300568     <0.01    51   Phylogenomic databases                     
   InParanoid                       2445961   2445961      0.02    31   Phylogenomic databases                     
   IntAct                             26076     26076     <0.01    76   Protein-protein interaction databases      
   InterPro                       264295760  82080351      2.46     1   Family and domain databases                
   KEGG                            14727590  14341066      0.14    16   Genome annotation databases                
   KO                               6364225   6338100      0.06    26   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    95   Organism-specific databases                
   Leproma                             1271      1269     <0.01   100   Organism-specific databases                
   MEROPS                            247629    247628     <0.01    53   Protein family/group databases             
   MGI                                60947     60571     <0.01    65   Organism-specific databases                
   MIM                                    4         4     <0.01   126   Organism-specific databases                
   MINT                                9708      9707     <0.01    83   Protein-protein interaction databases      
   MalaCards                              9         9     <0.01   122   Organism-specific databases                
   MaxQB                              40582     40582     <0.01    70   Proteomic databases                        
   MoonProt                               3         3     <0.01   128   Protein family/group databases             
   OGP                                    3         3     <0.01   129   2D gel databases                           
   OMA                              6429196   6429083      0.06    25   Phylogenomic databases                     
   OpenTargets                        48804     48755     <0.01    68   Organism-specific databases                
   OrthoDB                         14467111  14467021      0.13    17   Phylogenomic databases                     
   PANTHER                         20128910  19427875      0.19    13   Family and domain databases                
   PATRIC                          17985992  17983958      0.17    15   Genome annotation databases                
   PDB                                35292     17529     <0.01    71   3D structure databases                     
   PDBsum                             34047     16816     <0.01    73   3D structure databases                     
   PIR                               162978    130736     <0.01    56   Sequence databases                         
   PIRSF                            9162511   9087078      0.09    22   Family and domain databases                
   PMAP-CutDB                           131       131     <0.01   114   Other                                      
   PRIDE                             274389    274389     <0.01    52   Proteomic databases                        
   PRINTS                          13937615  12581255      0.13    19   Family and domain databases                
   PRO                                 2206      2206     <0.01    98   Other                                      
   PROSITE                         52659209  35121102      0.49     7   Family and domain databases                
   PaxDb                             372831    372831     <0.01    50   Proteomic databases                        
   PeptideAtlas                      117050    117050     <0.01    58   Proteomic databases                        
   PeroxiBase                          2481      2473     <0.01    96   Protein family/group databases             
   Pfam                           102427256  74430412      0.95     4   Family and domain databases                
   PharmGKB                            3154      3154     <0.01    93   Organism-specific databases                
   PhosphoSitePlus                     2282      2282     <0.01    97   PTM databases                              
   PhylomeDB                         469434    469434     <0.01    49   Phylogenomic databases                     
   PomBase                               31        31     <0.01   120   Organism-specific databases                
   ProDom                           1536856   1468268      0.01    35   Family and domain databases                
   ProMEX                              2636      2636     <0.01    94   Proteomic databases                        
   ProteinModelPortal               7378967   7378967      0.07    23   3D structure databases                     
   Proteomes                       91690730  87430687      0.85     5   Other                                      
   PseudoCAP                           4449      4445     <0.01    89   Organism-specific databases                
   REBASE                             31960     31945     <0.01    74   Protein family/group databases             
   REPRODUCTION-2DPAGE                   63        62     <0.01   119   2D gel databases                           
   RGD                                25067     23773     <0.01    77   Organism-specific databases                
   Reactome                          241138     86451     <0.01    54   Enzyme and pathway databases               
   RefSeq                          43939446  42959263      0.41     9   Sequence databases                         
   SABIO-RK                             602       602     <0.01   105   Enzyme and pathway databases               
   SFLD                              931551    486621      0.01    40   Family and domain databases                
   SGD                                    7         7     <0.01   124   Organism-specific databases                
   SIGNOR                                 8         8     <0.01   123   Enzyme and pathway databases               
   SMART                           24810672  18877390      0.23    11   Family and domain databases                
   SMR                              1134159   1134159      0.01    38   3D structure databases                     
   STRING                           6509273   6509164      0.06    24   Protein-protein interaction databases      
   SUPFAM                          69060245  54368045      0.64     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   130   2D gel databases                           
   SignaLink                           3804      3804     <0.01    91   Enzyme and pathway databases               
   SwissLipids                           82        82     <0.01   118   Chemistry                                  
   SwissPalm                           1218      1218     <0.01   101   PTM databases                              
   TAIR                               15706     15628     <0.01    81   Organism-specific databases                
   TCDB                                8074      8064     <0.01    85   Protein family/group databases             
   TIGRFAMs                        21696495  19937163      0.20    12   Family and domain databases                
   TopDownProteomics                    281       281     <0.01   109   Proteomic databases                        
   TreeFam                           568149    568116      0.01    47   Phylogenomic databases                     
   TubercuList                         1003      1002     <0.01   102   Organism-specific databases                
   UCSC                               93664     93468     <0.01    60   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   121   PTM databases                              
   UniGene                           848714    719551      0.01    43   Sequence databases                         
   UniPathway                       5995358   5539957      0.06    28   Enzyme and pathway databases               
   VectorBase                        592666    571657      0.01    46   Genome annotation databases                
   WBParaSite                        854112    845705      0.01    42   Genome annotation databases                
   World-2DPAGE                         316       311     <0.01   108   2D gel databases                           
   WormBase                           65638     65244     <0.01    64   Organism-specific databases                
   Xenbase                            34306     34246     <0.01    72   Organism-specific databases                
   ZFIN                               53578     53216     <0.01    66   Organism-specific databases                
   dictyBase                           7987      7765     <0.01    86   Organism-specific databases                
   eggNOG                          14104352   7069253      0.13    18   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    63   Organism-specific databases                
   iPTMnet                             5253      5253     <0.01    88   PTM databases                              
   mycoCLAP                             447       447     <0.01   107   Protein family/group databases             

Number of explicitly cross-referenced databases: 149


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.09   Gln (Q) 3.78   Leu (L) 9.87   Ser (S) 6.68
   Arg (R) 5.70   Glu (E) 6.17   Lys (K) 5.00   Thr (T) 5.56
   Asn (N) 3.89   Gly (G) 7.27   Met (M) 2.38   Trp (W) 1.29
   Asp (D) 5.47   His (H) 2.19   Phe (F) 3.93   Tyr (Y) 2.93
   Cys (C) 1.20   Ile (I) 5.73   Pro (P) 4.83   Val (V) 6.88

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.05

image

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Ile, Arg, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 1609569
Total number of entries encoded on a Plasmid: 799118
Total number of entries encoded on a Plastid: 112020
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: