Current Release Statistics


         UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2018_08 STATISTICS


1.  INTRODUCTION

Release 2018_08 of 12-Sep-2018 of UniProtKB/TrEMBL contains 124797108 sequence entries,
comprising 42025199451 amino acids.

5177933 sequences have been added since release 2018_07, the sequence data of
670 existing entries has been updated and the annotations of
16500700 entries have been revised. This represents an increase of 4%.

Number of fragments: 11623697

Protein existence (PE):              entries      %
1: Evidence at protein level          144459     0.12%
2: Evidence at transcript level      1162753     0.93%
3: Inferred from homology           30704463    24.60%
4: Predicted                        92785433    74.35%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.
image



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 905548

   The first twenty species represent 6964648 sequences:   5.6 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 430270
                            2x: 123245
                            3x:  65147
                            4x:  46191
                            5x:  28120
                            6x:  20198
                            7x:  15103
                            8x:  11863
                            9x:   9768
                           10x:  15016
                       11- 20x:  76792
                       21- 50x:  21529
                       51-100x:  11872
                         >100x:  30434


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     879856  Human immunodeficiency virus 1
       2     720688  Acidobacteria bacterium
       3     668601  marine sediment metagenome
       4     506839  Daphnia magna
       5     487552  Escherichia coli
       6     444710  Bacillus cereus
       7     341096  Verrucomicrobia bacterium
       8     324490  Gammaproteobacteria bacterium
       9     294957  Helicobacter pylori (Campylobacter pylori)
      10     262298  Euryarchaeota archaeon
      11     260021  Arundo donax (Giant reed) (Donax arundinaceus)
      12     259455  uncultured bacterium
      13     210183  Chloroflexi bacterium
      14     207459  Hordeum vulgare subsp. vulgare (Domesticated barley)
      15     206413  Gemmatimonadetes bacterium
      16     197143  Candidatus Rokubacteria bacterium
      17     185765  Pseudomonas fluorescens
      18     174449  Rhodospirillaceae bacterium
      19     169324  Stenotrophomonas maltophilia (Pseudomonas maltophilia) (Xanthomonas maltophilia)
      20     163349  Fundulus heteroclitus (Killifish) (Mummichog)
      21     161392  Pseudomonas putida (Arthrobacter siderocapsulatus)
      22     160329  Flavobacteriaceae bacterium
      23     153844  Homo sapiens (Human)
      24     151059  Hepacivirus C
      25     149405  Triticum aestivum (Wheat)
      26     146950  Streptococcus pneumoniae
      27     130733  Klebsiella pneumoniae
      28     129559  Candidatus Marinimicrobia bacterium
      29     129399  Zea mays (Maize)
      30     127745  mine drainage metagenome
      31     124874  Bacillus thuringiensis
      32     123924  Hepatitis B virus (HBV)
      33     121355  Enterobacter cloacae
      34     121109  Mycobacteroides abscessus subsp. abscessus
      35     118638  Oryza sativa subsp. japonica (Rice)
      36     116737  Acidimicrobiaceae bacterium
      37     116687  Flavobacteriales bacterium
      38     114537  Planctomycetaceae bacterium
      39     113860  Deltaproteobacteria bacterium
      40     111944  uncultured Clostridium sp
      41     111477  groundwater metagenome
      42     108767  Pan troglodytes (Chimpanzee)
      43     104820  Pseudomonas aeruginosa
      44     103975  Rhizophagus irregularis
      45     103727  Anguilla anguilla (European freshwater eel) (Muraena anguilla)
      46     103119  Staphylococcus aureus
      47      99351  Pyramidula sp. HNHM
   2.3  Taxonomic distribution of the sequences

image

   Kingdom        sequences (% of the database)
    Archaea         2764548 (  2%)
    Bacteria       87051270 ( 70%)
    Eukaryota      30239194 ( 24%)
    Viruses         3596915 (  3%)
    Other           1145181 ( <1%)



   Within Eukaryota:

image

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 153919 (  1%)           (  0%)
     Other Mammalia       2396453 (  8%)           (  2%)
     Other Vertebrata     2930198 ( 10%)           (  2%)
     Viridiplantae        6370936 ( 21%)           (  5%)
     Fungi                8477371 ( 28%)           (  7%)
     Insecta              3216363 ( 11%)           (  3%)
     Nematoda             1558753 (  5%)           (  1%)
     Other                5135201 ( 17%)           (  4%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50  1742814             1001-1100    858871
                 51- 100 10447839             1101-1200    596513
                101- 150 12551846             1201-1300    402533
                151- 200 12106267             1301-1400    272616
                201- 250 12029555             1401-1500    213064
                251- 300 11959607             1501-1600    155903
                301- 350 10872876             1601-1700    117348
                351- 400  8432334             1701-1800     90183
                401- 450  7172263             1801-1900     78187
                451- 500  5765505             1901-2000     64756
                501- 550  4006026             2001-2100     52484
                551- 600  3046780             2101-2200     50387
                601- 650  2239933             2201-2300     39034
                651- 700  1756528             2301-2400     32182
                701- 750  1497271             2401-2500     27507
                751- 800  1283238             >2500        210761
                801- 850   992527
                851- 900   855321
                901- 950   652416
                951-1000   500136

image


   The average sequence length in UniProtKB/TrEMBL is   336 amino acids.

   The shortest sequence is     C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  A0A1V4K6M4_PATFA: 36991 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   146157937                1.17                                                    
   Submitted to EMBL/GenBank/DDBJ  99319203  90495826      0.80                                                    
   Journal                         40661326  38308117      0.33                                                    
   Submitted to other databases     6151082   6124696      0.05                                                    
   Thesis                             14951     14892     <0.01                                                    
   Book citation                      11374     11309     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 685454


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     170539484                1.37                                                    
   ACTIVITY REGULATION               339244    339242     <0.01    11                                              
   CATALYTIC ACTIVITY              14381512  13091072      0.12     5                                              
   CAUTION                         70151363  68574919      0.56     1                                              
   COFACTOR                         6258508   5724172      0.05     8                                              
   DOMAIN                            955264    901760      0.01     9                                              
   FUNCTION                        16578674  15797912      0.13     3                                              
   INTERACTION                         3467      3467     <0.01    12                                              
   MISCELLANEOUS                     566341    557689     <0.01    10                                              
   PATHWAY                          7332557   6600394      0.06     7                                              
   SIMILARITY                      30758219  30336990      0.25     2                                              
   SUBCELLULAR LOCATION            14540735  14505565      0.12     4                                              
   SUBUNIT                          8673600   8580451      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     318230511                2.55                                                    
   ACT_SITE                         7757911   4731702      0.06     9                                              
   BINDING                         16581714   4203990      0.13     5                                              
   CARBOHYD                           21350     20202     <0.01    23                                              
   CHAIN                            9060261   9048384      0.07     7                                              
   COILED                          18211431  12132671      0.15     3                                              
   COMPBIAS                            4682      4682     <0.01    26                                              
   CROSSLNK                           29523     27492     <0.01    22                                              
   DISULFID                         1914045    513206      0.02    16                                              
   DNA_BIND                         3158468   2799079      0.03    13                                              
   DOMAIN                          87687613  63302921      0.70     2                                              
   INIT_MET                           39076     39076     <0.01    21                                              
   INTRAMEM                            1172       956     <0.01    27                                              
   LIPID                             294637    154470     <0.01    19                                              
   METAL                           13136672   3495881      0.11     6                                              
   MOD_RES                          2355129   2116658      0.02    14                                              
   MOTIF                            1425739    994209      0.01    17                                              
   NON_STD                             5634      5441     <0.01    25                                              
   NON_TER                         17423045  11676371      0.14     4                                              
   NP_BIND                          7126807   4503539      0.06    10                                              
   PEPTIDE                              715       450     <0.01    28                                              
   PROPEP                             17692     17692     <0.01    24                                              
   REGION                           5235851   2785307      0.04    11                                              
   REPEAT                           4821710   1152766      0.04    12                                              
   SIGNAL                           9035741   9035731      0.07     8                                              
   SITE                             2039460   1212136      0.02    15                                              
   TOPO_DOM                          263694    101919     <0.01    20                                              
   TRANSIT                              137       137     <0.01    29                                              
   TRANSMEM                       110147613  24205755      0.88     1                                              
   ZN_FING                           432989    341631     <0.01    18                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             1471684995               11.79                                                    
   Allergome                           3949      3184     <0.01    90   Protein family/group databases             
   ArachnoServer                        200       200     <0.01   113   Organism-specific databases                
   Araport                            15228     15161     <0.01    81   Organism-specific databases                
   BRENDA                              9568      9278     <0.01    84   Enzyme and pathway databases               
   Bgee                              531282    531273     <0.01    48   Gene expression databases                  
   BindingDB                            271       271     <0.01   111   Chemistry                                  
   BioCyc                           6070700   6052419      0.05    29   Enzyme and pathway databases               
   CAZy                              129045    120758     <0.01    58   Protein family/group databases             
   CDD                             22325546  19613375      0.18    14   Family and domain databases                
   CGD                                20801     20735     <0.01    79   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   132   2D gel databases                           
   CORUM                                114       114     <0.01   119   Protein-protein interaction databases      
   CTD                              1139402   1137430      0.01    40   Organism-specific databases                
   CarbonylDB                           265       265     <0.01   112   PTM databases                              
   ChEMBL                               965       965     <0.01   104   Chemistry                                  
   ChiTaRS                           131460    131459     <0.01    57   Other                                      
   CollecTF                             199       199     <0.01   114   Gene expression databases                  
   ComplexPortal                        182       133     <0.01   115   Protein-protein interaction databases      
   ConoServer                           159       159     <0.01   116   Organism-specific databases                
   DIP                                 3216      3215     <0.01    93   Protein-protein interaction databases      
   DNASU                              41279     40840     <0.01    71   Protocols and materials databases          
   DisProt                               96        96     <0.01   121   3D structure databases                     
   DrugBank                             742       449     <0.01   105   Chemistry                                  
   ELM                                  107       107     <0.01   120   Protein-protein interaction databases      
   EMBL                           137140385 120773326      1.10     3   Sequence databases                         
   EPD                                14085     14085     <0.01    82   Proteomic databases                        
   ESTHER                             74490     74192     <0.01    64   Protein family/group databases             
   Ensembl                          1908003   1864078      0.02    34   Genome annotation databases                
   EnsemblBacteria                 39072240  36859779      0.31    10   Genome annotation databases                
   EnsemblFungi                     6182788   6076488      0.05    28   Genome annotation databases                
   EnsemblMetazoa                   1179113   1126925      0.01    39   Genome annotation databases                
   EnsemblPlants                    2164410   1964895      0.02    32   Genome annotation databases                
   EnsemblProtists                  1872785   1760840      0.02    35   Genome annotation databases                
   EuPathDB                          671599    670904      0.01    44   Organism-specific databases                
   EvolutionaryTrace                   5943      5943     <0.01    87   Other                                      
   ExpressionAtlas                   638053    637794      0.01    45   Gene expression databases                  
   FlyBase                           208332    207035     <0.01    55   Organism-specific databases                
   GO                             221906014  80032879      1.78     2   Ontologies                                 
   Gene3D                          53820161  44806380      0.43     8   Family and domain databases                
   GeneCards                           1310      1291     <0.01   101   Organism-specific databases                
   GeneDB                            114675    112895     <0.01    60   Genome annotation databases                
   GeneID                          10640022  10533434      0.09    22   Genome annotation databases                
   GeneTree                         1831524   1831446      0.01    36   Phylogenomic databases                     
   Genevisible                        15842     15835     <0.01    80   Gene expression databases                  
   GenomeRNAi                         29997     29997     <0.01    76   Other                                      
   GlyConnect                            13        13     <0.01   126   PTM databases                              
   Gramene                          2164410   1964895      0.02    33   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   130   Chemistry                                  
   H-InvDB                              587       440     <0.01   107   Organism-specific databases                
   HAMAP                           13735490  13580999      0.11    20   Family and domain databases                
   HGNC                               51986     51890     <0.01    68   Organism-specific databases                
   HOGENOM                          2996720   2996639      0.02    30   Phylogenomic databases                     
   HOVERGEN                          300367    300354     <0.01    53   Phylogenomic databases                     
   InParanoid                       2347029   2347029      0.02    31   Phylogenomic databases                     
   IntAct                             26640     26640     <0.01    77   Protein-protein interaction databases      
   InterPro                       315189003  96517965      2.53     1   Family and domain databases                
   KEGG                            16276301  15844081      0.13    17   Genome annotation databases                
   KO                               7159396   7130103      0.06    24   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    96   Organism-specific databases                
   Leproma                             1271      1269     <0.01   102   Organism-specific databases                
   MEROPS                            243185    243184     <0.01    54   Protein family/group databases             
   MGI                                61793     61358     <0.01    65   Organism-specific databases                
   MIM                                    4         4     <0.01   131   Organism-specific databases                
   MINT                                2829      2829     <0.01    95   Protein-protein interaction databases      
   MalaCards                             12        12     <0.01   127   Organism-specific databases                
   MaxQB                              42229     42229     <0.01    70   Proteomic databases                        
   MoonDB                                 1         1     <0.01   135   Protein family/group databases             
   MoonProt                              64        64     <0.01   123   Protein family/group databases             
   OGP                                    3         3     <0.01   133   2D gel databases                           
   OMA                              6851650   6851573      0.05    26   Phylogenomic databases                     
   OpenTargets                        49923     49874     <0.01    69   Organism-specific databases                
   OrthoDB                         14233319  14233199      0.11    18   Phylogenomic databases                     
   PANTHER                         25728130  24835119      0.21    13   Family and domain databases                
   PATRIC                          17342003  17331375      0.14    15   Genome annotation databases                
   PDB                                37489     18336     <0.01    72   3D structure databases                     
   PDBsum                             37023     18029     <0.01    73   3D structure databases                     
   PIR                               162677    130437     <0.01    56   Sequence databases                         
   PIRSF                           10831819  10741297      0.09    21   Family and domain databases                
   PMAP-CutDB                           130       130     <0.01   118   Other                                      
   PRIDE                             336418    336418     <0.01    50   Proteomic databases                        
   PRINTS                          16287824  14696318      0.13    16   Family and domain databases                
   PRO                                 2259      2259     <0.01    98   Other                                      
   PROSITE                         62184467  41476922      0.50     7   Family and domain databases                
   PaxDb                             328104    328104     <0.01    51   Proteomic databases                        
   PeptideAtlas                      128755    128755     <0.01    59   Proteomic databases                        
   PeroxiBase                          2475      2467     <0.01    97   Protein family/group databases             
   Pfam                           121298027  88142439      0.97     4   Family and domain databases                
   PharmGKB                            3134      3134     <0.01    94   Organism-specific databases                
   PhosphoSitePlus                     2240      2240     <0.01    99   PTM databases                              
   PhylomeDB                         461341    461341     <0.01    49   Phylogenomic databases                     
   PomBase                                2         2     <0.01   134   Organism-specific databases                
   ProDom                           1747109   1674847      0.01    37   Family and domain databases                
   ProMEX                              3230      3230     <0.01    92   Proteomic databases                        
   ProteinModelPortal               7191950   7191950      0.06    23   3D structure databases                     
   Proteomes                      101631905  96303018      0.81     5   Other                                      
   PseudoCAP                           4449      4445     <0.01    89   Organism-specific databases                
   REBASE                             31413     31401     <0.01    75   Protein family/group databases             
   REPRODUCTION-2DPAGE                   62        61     <0.01   124   2D gel databases                           
   RGD                                21620     20727     <0.01    78   Organism-specific databases                
   Reactome                          327177    116140     <0.01    52   Enzyme and pathway databases               
   RefSeq                          44968918  43861624      0.36     9   Sequence databases                         
   SABIO-RK                             644       644     <0.01   106   Enzyme and pathway databases               
   SFLD                             1120242    580497      0.01    41   Family and domain databases                
   SGD                                    7         7     <0.01   128   Organism-specific databases                
   SIGNOR                                 7         7     <0.01   129   Enzyme and pathway databases               
   SMART                           29400221  22336824      0.24    11   Family and domain databases                
   SMR                              1239713   1239713      0.01    38   3D structure databases                     
   STRING                           6441418   6441169      0.05    27   Protein-protein interaction databases      
   SUPFAM                          80659935  63852314      0.65     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   136   2D gel databases                           
   SignaLink                           3795      3795     <0.01    91   Enzyme and pathway databases               
   SwissLipids                           82        82     <0.01   122   Chemistry                                  
   SwissPalm                           1904      1904     <0.01   100   PTM databases                              
   TAIR                               11892     11830     <0.01    83   Organism-specific databases                
   TCDB                                8186      8175     <0.01    85   Protein family/group databases             
   TIGRFAMs                        25843448  23771886      0.21    12   Family and domain databases                
   TopDownProteomics                    280       280     <0.01   110   Proteomic databases                        
   TreeFam                           558538    558502     <0.01    47   Phylogenomic databases                     
   TubercuList                         1000       999     <0.01   103   Organism-specific databases                
   UCSC                               93123     92918     <0.01    61   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   125   PTM databases                              
   UniGene                           865999    732249      0.01    42   Sequence databases                         
   UniLectin                            156       156     <0.01   117   Protein family/group databases             
   UniPathway                       7125425   6578194      0.06    25   Enzyme and pathway databases               
   VGNC                               79929     79929     <0.01    62   Organism-specific databases                
   VectorBase                        580334    561628     <0.01    46   Genome annotation databases                
   WBParaSite                        854112    845705      0.01    43   Genome annotation databases                
   World-2DPAGE                         316       311     <0.01   109   2D gel databases                           
   WormBase                           55884     55500     <0.01    66   Organism-specific databases                
   Xenbase                            34432     34353     <0.01    74   Organism-specific databases                
   ZFIN                               52649     52520     <0.01    67   Organism-specific databases                
   dictyBase                           7987      7765     <0.01    86   Organism-specific databases                
   eggNOG                          13807043   6920731      0.11    19   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    63   Organism-specific databases                
   iPTMnet                             5136      5136     <0.01    88   PTM databases                              
   mycoCLAP                             447       447     <0.01   108   Protein family/group databases             

Number of explicitly cross-referenced databases: 156


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.15   Gln (Q) 3.78   Leu (L) 9.89   Ser (S) 6.67
   Arg (R) 5.74   Glu (E) 6.17   Lys (K) 4.96   Thr (T) 5.55
   Asn (N) 3.86   Gly (G) 7.31   Met (M) 2.37   Trp (W) 1.30
   Asp (D) 5.47   His (H) 2.19   Phe (F) 3.92   Tyr (Y) 2.91
   Cys (C) 1.19   Ile (I) 5.69   Pro (P) 4.86   Val (V) 6.90

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.05

image

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Arg, Ile, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 1776043
Total number of entries encoded on a Plasmid: 870796
Total number of entries encoded on a Plastid: 121853
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: