Current Release Statistics


         UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2018_05 STATISTICS


1.  INTRODUCTION

Release 2018_05 of 23-May-2018 of UniProtKB/TrEMBL contains 115678811 sequence entries,
comprising 38982680753 amino acids.

1479097 sequences have been added since release 2018_04, the sequence data of
4598 existing entries has been updated and the annotations of
19802872 entries have been revised. This represents an increase of 1%.

Number of fragments: 10797121

Protein existence (PE):              entries      %
1: Evidence at protein level          142928     0.12%
2: Evidence at transcript level      1180723     1.02%
3: Inferred from homology           29227113    25.27%
4: Predicted                        85128047    73.59%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.
image



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 841703

   The first twenty species represent 5746684 sequences:     5 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 375980
                            2x: 119856
                            3x:  63949
                            4x:  45467
                            5x:  27590
                            6x:  19891
                            7x:  14902
                            8x:  11729
                            9x:   9638
                           10x:  14872
                       11- 20x:  76048
                       21- 50x:  21206
                       51-100x:  11392
                         >100x:  29183


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     857550  Human immunodeficiency virus 1
       2     668601  marine sediment metagenome
       3     506839  Daphnia magna
       4     448045  Bacillus cereus
       5     324490  Gammaproteobacteria bacterium
       6     281178  Escherichia coli
       7     261883  Euryarchaeota archaeon
       8     260021  Arundo donax (Giant reed) (Donax arundinaceus)
       9     252258  uncultured bacterium
      10     220461  Helicobacter pylori (Campylobacter pylori)
      11     207273  Hordeum vulgare subsp. vulgare (Domesticated barley)
      12     190111  Pseudomonas fluorescens
      13     182247  Chloroflexi bacterium
      14     174449  Rhodospirillaceae bacterium
      15     163349  Fundulus heteroclitus (Killifish) (Mummichog)
      16     154302  Flavobacteriaceae bacterium
      17     151161  Homo sapiens (Human)
      18     150901  Hepacivirus C
      19     145866  Streptococcus pneumoniae
      20     145699  Triticum aestivum (Wheat)
      21     139671  Pseudomonas putida (Arthrobacter siderocapsulatus)
      22     129559  Candidatus Marinimicrobia bacterium
      23     129406  Zea mays (Maize)
      24     127745  mine drainage metagenome
      25     121627  Bacillus thuringiensis
      26     121581  Hepatitis B virus (HBV)
      27     121109  Mycobacterium abscessus subsp. abscessus
      28     118719  Oryza sativa subsp. japonica (Rice)
      29     116737  Acidimicrobiaceae bacterium
      30     116687  Flavobacteriales bacterium
      31     114537  Planctomycetaceae bacterium
      32     113860  Deltaproteobacteria bacterium
      33     113656  Enterobacter cloacae
      34     111944  uncultured Clostridium sp
      35     111477  groundwater metagenome
      36     108765  Pan troglodytes (Chimpanzee)
      37     106855  Stenotrophomonas maltophilia (Pseudomonas maltophilia) (Xanthomonas maltophilia)
      38     103975  Rhizophagus irregularis
      39     103727  Anguilla anguilla (European freshwater eel) (Muraena anguilla)
      40      99347  Pyramidula sp. HNHM
   2.3  Taxonomic distribution of the sequences

image

   Kingdom        sequences (% of the database)
    Archaea         2540807 (  2%)
    Bacteria       79450882 ( 69%)
    Eukaryota      29026907 ( 25%)
    Viruses         3515397 (  3%)
    Other           1144818 ( <1%)



   Within Eukaryota:

image

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 151236 (  1%)           (  0%)
     Other Mammalia       2102912 (  7%)           (  2%)
     Other Vertebrata     2893666 ( 10%)           (  3%)
     Viridiplantae        6144730 ( 21%)           (  5%)
     Fungi                8106491 ( 28%)           (  7%)
     Insecta              3068175 ( 11%)           (  3%)
     Nematoda             1558581 (  5%)           (  1%)
     Other                5001116 ( 17%)           (  4%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50  1680409             1001-1100    797621
                 51- 100  9722307             1101-1200    554009
                101- 150 11632664             1201-1300    375118
                151- 200 11201291             1301-1400    255161
                201- 250 11128349             1401-1500    199658
                251- 300 11056486             1501-1600    145992
                301- 350 10037608             1601-1700    109907
                351- 400  7794944             1701-1800     84806
                401- 450  6627234             1801-1900     73439
                451- 500  5338878             1901-2000     60674
                501- 550  3711486             2001-2100     49131
                551- 600  2829619             2101-2200     47493
                601- 650  2079451             2201-2300     36719
                651- 700  1629714             2301-2400     30127
                701- 750  1389210             2401-2500     25757
                751- 800  1195320             >2500        198053
                801- 850   919107
                851- 900   794706
                901- 950   604583
                951-1000   464659

image


   The average sequence length in UniProtKB/TrEMBL is   336 amino acids.

   The shortest sequence is     C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  A0A1V4K6M4_PATFA: 36991 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   135768967                1.17                                                    
   Submitted to EMBL/GenBank/DDBJ  91476195  83419752      0.79                                                    
   Journal                         38415100  36090908      0.33                                                    
   Submitted to other databases     5851346   5827273      0.05                                                    
   Thesis                             14961     14902     <0.01                                                    
   Book citation                      11364     11299     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 671877


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     160246765                1.39                                                    
   CATALYTIC ACTIVITY              13707102  12455861      0.12     5                                              
   CAUTION                         63215657  61738658      0.55     1                                              
   COFACTOR                         6439456   5877696      0.06     8                                              
   DOMAIN                           1199457    917269      0.01     9                                              
   ENZYME REGULATION                 330948    330946     <0.01    11                                              
   FUNCTION                        16024944  15126438      0.14     3                                              
   INTERACTION                         3500      3500     <0.01    12                                              
   MISCELLANEOUS                     701792    632250      0.01    10                                              
   PATHWAY                          6969173   6268730      0.06     7                                              
   SIMILARITY                      29375998  28972865      0.25     2                                              
   SUBCELLULAR LOCATION            13927050  13812981      0.12     4                                              
   SUBUNIT                          8351688   8245142      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     304253811                2.63                                                    
   ACT_SITE                         7248099   4355583      0.06     9                                              
   BINDING                         15400592   3975197      0.13     5                                              
   CARBOHYD                           21133     20028     <0.01    23                                              
   CHAIN                            8630770   8619536      0.07     7                                              
   COILED                          17475498  11657162      0.15     3                                              
   COMPBIAS                            4516      4516     <0.01    26                                              
   CROSSLNK                           31671     29680     <0.01    22                                              
   DISULFID                         2087605    551667      0.02    15                                              
   DNA_BIND                         3002420   2660736      0.03    13                                              
   DOMAIN                          83811578  60454686      0.72     2                                              
   INIT_MET                           53114     53114     <0.01    21                                              
   INTRAMEM                            1117       913     <0.01    27                                              
   LIPID                             355038    203595     <0.01    19                                              
   METAL                           12507727   3329990      0.11     6                                              
   MOD_RES                          2418297   2112182      0.02    14                                              
   MOTIF                            1531174   1037683      0.01    17                                              
   NON_STD                             4698      4505     <0.01    25                                              
   NON_TER                         16360471  10844864      0.14     4                                              
   NP_BIND                          6708646   4248048      0.06    10                                              
   PEPTIDE                              694       429     <0.01    28                                              
   PROPEP                             16705     16705     <0.01    24                                              
   REGION                           5342775   2742637      0.05    11                                              
   REPEAT                           4582786   1098429      0.04    12                                              
   SIGNAL                           8607528   8607518      0.07     8                                              
   SITE                             2014822   1218133      0.02    16                                              
   TOPO_DOM                          318157    148431     <0.01    20                                              
   TRANSIT                              129       129     <0.01    29                                              
   TRANSMEM                       105303032  23132017      0.91     1                                              
   ZN_FING                           413019    326152     <0.01    18                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             1402131913               12.12                                                    
   Allergome                           3910      3156     <0.01    90   Protein family/group databases             
   ArachnoServer                        201       201     <0.01   113   Organism-specific databases                
   Araport                            15260     15193     <0.01    81   Organism-specific databases                
   BRENDA                              9577      9285     <0.01    84   Enzyme and pathway databases               
   Bgee                              537458    537235     <0.01    48   Gene expression databases                  
   BindingDB                            264       264     <0.01   112   Chemistry                                  
   BioCyc                           6085926   6067531      0.05    29   Enzyme and pathway databases               
   CAZy                              129273    120976     <0.01    59   Protein family/group databases             
   CDD                             20990154  18459573      0.18    14   Family and domain databases                
   CGD                                20805     20739     <0.01    79   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   131   2D gel databases                           
   CORUM                                114       114     <0.01   117   Protein-protein interaction databases      
   CTD                               981349    979432      0.01    41   Organism-specific databases                
   CarbonylDB                           266       266     <0.01   111   PTM databases                              
   ChEMBL                               882       882     <0.01   104   Chemistry                                  
   ChiTaRS                           131701    131701     <0.01    57   Other                                      
   CollecTF                             200       200     <0.01   114   Gene expression databases                  
   ConoServer                           160       160     <0.01   115   Organism-specific databases                
   DIP                                 3220      3219     <0.01    92   Protein-protein interaction databases      
   DNASU                              41320     40881     <0.01    71   Protocols and materials databases          
   DisProt                               96        96     <0.01   119   3D structure databases                     
   DrugBank                             742       449     <0.01   105   Chemistry                                  
   ELM                                  110       110     <0.01   118   Protein-protein interaction databases      
   EMBL                           125905528 111935062      1.09     3   Sequence databases                         
   EPD                                13095     13095     <0.01    82   Proteomic databases                        
   ESTHER                             75045     74750     <0.01    64   Protein family/group databases             
   Ensembl                          1863012   1811222      0.02    35   Genome annotation databases                
   EnsemblBacteria                 39665879  37469046      0.34    10   Genome annotation databases                
   EnsemblFungi                     6190906   6084607      0.05    28   Genome annotation databases                
   EnsemblMetazoa                   1152362   1100202      0.01    39   Genome annotation databases                
   EnsemblPlants                    2184158   1984173      0.02    32   Genome annotation databases                
   EnsemblProtists                  1866272   1754316      0.02    34   Genome annotation databases                
   EuPathDB                          679055    678264      0.01    45   Organism-specific databases                
   EvolutionaryTrace                   5948      5948     <0.01    87   Other                                      
   ExpressionAtlas                   679897    679897      0.01    44   Gene expression databases                  
   FlyBase                           212763    211437     <0.01    55   Organism-specific databases                
   GO                             208598690  74568749      1.80     2   Ontologies                                 
   Gene3D                          51245179  42700717      0.44     8   Family and domain databases                
   GeneCards                           1373      1352     <0.01   101   Organism-specific databases                
   GeneDB                            114676    112896     <0.01    60   Genome annotation databases                
   GeneID                          10677010  10569201      0.09    21   Genome annotation databases                
   GeneTree                         1790136   1790061      0.02    36   Phylogenomic databases                     
   Genevisible                        15872     15865     <0.01    80   Gene expression databases                  
   GenomeRNAi                         30067     30067     <0.01    76   Other                                      
   GlyConnect                            13        13     <0.01   125   PTM databases                              
   Gramene                          2184158   1984173      0.02    33   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   129   Chemistry                                  
   H-InvDB                              588       441     <0.01   107   Organism-specific databases                
   HAMAP                           13158023  13010271      0.11    20   Family and domain databases                
   HGNC                               50576     50479     <0.01    68   Organism-specific databases                
   HOGENOM                          3024164   3024074      0.03    30   Phylogenomic databases                     
   HOVERGEN                          300496    300484     <0.01    52   Phylogenomic databases                     
   InParanoid                       2377791   2377791      0.02    31   Phylogenomic databases                     
   IntAct                             26148     26148     <0.01    77   Protein-protein interaction databases      
   InterPro                       299715133  92053642      2.59     1   Family and domain databases                
   KEGG                            15904742  15477418      0.14    16   Genome annotation databases                
   KO                               6934726   6906283      0.06    24   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    95   Organism-specific databases                
   Leproma                             1271      1269     <0.01   102   Organism-specific databases                
   MEROPS                            245685    245684     <0.01    54   Protein family/group databases             
   MGI                                61520     61083     <0.01    65   Organism-specific databases                
   MIM                                    4         4     <0.01   130   Organism-specific databases                
   MINT                                2463      2463     <0.01    97   Protein-protein interaction databases      
   MalaCards                             12        12     <0.01   126   Organism-specific databases                
   MaxQB                              41701     41701     <0.01    70   Proteomic databases                        
   MoonProt                              67        67     <0.01   121   Protein family/group databases             
   OGP                                    3         3     <0.01   132   2D gel databases                           
   OMA                              6897332   6897318      0.06    25   Phylogenomic databases                     
   OpenTargets                        48592     48542     <0.01    69   Organism-specific databases                
   OrthoDB                         14352316  14352207      0.12    18   Phylogenomic databases                     
   PANTHER                         23431346  22605250      0.20    13   Family and domain databases                
   PATRIC                          17721902  17719854      0.15    15   Genome annotation databases                
   PDB                                36622     18031     <0.01    72   3D structure databases                     
   PDBsum                             33923     16797     <0.01    74   3D structure databases                     
   PIR                               162722    130479     <0.01    56   Sequence databases                         
   PIRSF                           10360205  10273455      0.09    22   Family and domain databases                
   PMAP-CutDB                           130       130     <0.01   116   Other                                      
   PRIDE                             333174    333174     <0.01    50   Proteomic databases                        
   PRINTS                          15546756  14021983      0.13    17   Family and domain databases                
   PRO                                 2196      2196     <0.01    99   Other                                      
   PROSITE                         59382447  39578538      0.51     7   Family and domain databases                
   PaxDb                             329928    329928     <0.01    51   Proteomic databases                        
   PeptideAtlas                      131317    131317     <0.01    58   Proteomic databases                        
   PeroxiBase                          2476      2468     <0.01    96   Protein family/group databases             
   Pfam                           115740265  84071820      1.00     4   Family and domain databases                
   PharmGKB                            3146      3146     <0.01    93   Organism-specific databases                
   PhosphoSitePlus                     2254      2254     <0.01    98   PTM databases                              
   PhylomeDB                         461627    461627     <0.01    49   Phylogenomic databases                     
   PomBase                               31        31     <0.01   123   Organism-specific databases                
   ProDom                           1689480   1617658      0.01    37   Family and domain databases                
   ProMEX                              2526      2526     <0.01    94   Proteomic databases                        
   ProteinModelPortal               7292512   7292512      0.06    23   3D structure databases                     
   Proteomes                       95012662  90340316      0.82     5   Other                                      
   PseudoCAP                           4449      4445     <0.01    89   Organism-specific databases                
   REBASE                             31686     31675     <0.01    75   Protein family/group databases             
   REPRODUCTION-2DPAGE                   63        62     <0.01   122   2D gel databases                           
   RGD                                22886     21725     <0.01    78   Organism-specific databases                
   Reactome                          281296    101809     <0.01    53   Enzyme and pathway databases               
   RefSeq                          44691342  43622632      0.39     9   Sequence databases                         
   SABIO-RK                             614       614     <0.01   106   Enzyme and pathway databases               
   SFLD                             1074589    557642      0.01    40   Family and domain databases                
   SGD                                    7         7     <0.01   128   Organism-specific databases                
   SIGNOR                                 8         8     <0.01   127   Enzyme and pathway databases               
   SMART                           28073819  21329379      0.24    11   Family and domain databases                
   SMR                              1219907   1219907      0.01    38   3D structure databases                     
   STRING                           6487326   6487134      0.06    27   Protein-protein interaction databases      
   SUPFAM                          76568067  60673144      0.66     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   133   2D gel databases                           
   SignaLink                           3799      3799     <0.01    91   Enzyme and pathway databases               
   SwissLipids                           81        81     <0.01   120   Chemistry                                  
   SwissPalm                           2058      2058     <0.01   100   PTM databases                              
   TAIR                               11921     11859     <0.01    83   Organism-specific databases                
   TCDB                                8151      8140     <0.01    85   Protein family/group databases             
   TIGRFAMs                        24707269  22726322      0.21    12   Family and domain databases                
   TopDownProteomics                    280       280     <0.01   110   Proteomic databases                        
   TreeFam                           563543    563509     <0.01    47   Phylogenomic databases                     
   TubercuList                         1002      1001     <0.01   103   Organism-specific databases                
   UCSC                               93367     93170     <0.01    61   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   124   PTM databases                              
   UniGene                           881259    749059      0.01    42   Sequence databases                         
   UniPathway                       6771263   6247467      0.06    26   Enzyme and pathway databases               
   VGNC                               77336     77336     <0.01    62   Organism-specific databases                
   VectorBase                        578282    559616     <0.01    46   Genome annotation databases                
   WBParaSite                        854112    845705      0.01    43   Genome annotation databases                
   World-2DPAGE                         316       311     <0.01   109   2D gel databases                           
   WormBase                           55888     55504     <0.01    66   Organism-specific databases                
   Xenbase                            34335     34259     <0.01    73   Organism-specific databases                
   ZFIN                               53195     53081     <0.01    67   Organism-specific databases                
   dictyBase                           7987      7765     <0.01    86   Organism-specific databases                
   eggNOG                          13959932   6997151      0.12    19   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    63   Organism-specific databases                
   iPTMnet                             5160      5160     <0.01    88   PTM databases                              
   mycoCLAP                             447       447     <0.01   108   Protein family/group databases             

Number of explicitly cross-referenced databases: 152


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.06   Gln (Q) 3.78   Leu (L) 9.87   Ser (S) 6.69
   Arg (R) 5.69   Glu (E) 6.18   Lys (K) 5.01   Thr (T) 5.55
   Asn (N) 3.90   Gly (G) 7.27   Met (M) 2.38   Trp (W) 1.29
   Asp (D) 5.47   His (H) 2.19   Phe (F) 3.94   Tyr (Y) 2.93
   Cys (C) 1.20   Ile (I) 5.73   Pro (P) 4.84   Val (V) 6.87

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.05

image

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Ile, Arg, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 1683626
Total number of entries encoded on a Plasmid: 835488
Total number of entries encoded on a Plastid: 120297
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: