Current Release Statistics


         UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2018_09 STATISTICS


1.  INTRODUCTION

Release 2018_09 of 10-Oct-2018 of UniProtKB/TrEMBL contains 126780198 sequence entries,
comprising 42722085878 amino acids.

3251059 sequences have been added since release 2018_08, the sequence data of
12922 existing entries has been updated and the annotations of
21908337 entries have been revised. This represents an increase of 3%.

Number of fragments: 11713556

Protein existence (PE):              entries      %
1: Evidence at protein level          144073     0.11%
2: Evidence at transcript level      1196091     0.94%
3: Inferred from homology           31620095    24.94%
4: Predicted                        93819939    74.00%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.
image



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 935957

   The first twenty species represent 7029219 sequences:   5.5 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 456908
                            2x: 123919
                            3x:  65483
                            4x:  46454
                            5x:  28359
                            6x:  20365
                            7x:  15284
                            8x:  12034
                            9x:   9850
                           10x:  15130
                       11- 20x:  77371
                       21- 50x:  21766
                       51-100x:  12134
                         >100x:  30900


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     882141  Human immunodeficiency virus 1
       2     703927  Acidobacteria bacterium
       3     668601  marine sediment metagenome
       4     506841  Daphnia magna
       5     449949  Bacillus cereus
       6     421089  Escherichia coli
       7     338525  Verrucomicrobia bacterium
       8     324490  Gammaproteobacteria bacterium
       9     295997  Helicobacter pylori (Campylobacter pylori)
      10     272569  Euryarchaeota archaeon
      11     264167  Klebsiella pneumoniae
      12     260913  uncultured bacterium
      13     260021  Arundo donax (Giant reed) (Donax arundinaceus)
      14     212662  Chloroflexi bacterium
      15     210810  Triticum aestivum (Wheat)
      16     207466  Hordeum vulgare subsp. vulgare (Domesticated barley)
      17     202740  Gemmatimonadetes bacterium
      18     190686  Candidatus Rokubacteria bacterium
      19     181176  Pseudomonas fluorescens
      20     174449  Rhodospirillaceae bacterium
      21     170628  Pseudomonas putida (Arthrobacter siderocapsulatus)
      22     163349  Fundulus heteroclitus (Killifish) (Mummichog)
      23     160329  Flavobacteriaceae bacterium
      24     155142  Acinetobacter baumannii
      25     153945  Homo sapiens (Human)
      26     151500  Hepacivirus C
      27     145818  Streptococcus pneumoniae
      28     134513  Enterobacter cloacae
      29     134152  Zea mays (Maize)
      30     131468  Stenotrophomonas maltophilia (Pseudomonas maltophilia) (Xanthomonas maltophilia)
      31     129559  Candidatus Marinimicrobia bacterium
      32     127745  mine drainage metagenome
      33     124330  Hepatitis B virus (HBV)
      34     123197  Bacillus thuringiensis
      35     119324  Deltaproteobacteria bacterium
      36     118604  Oryza sativa subsp. japonica (Rice)
      37     118383  Mycobacteroides abscessus subsp. abscessus
      38     118369  Flavobacteriales bacterium
      39     116737  Acidimicrobiaceae bacterium
      40     114537  Planctomycetaceae bacterium
      41     111944  uncultured Clostridium sp
      42     111477  groundwater metagenome
      43     108767  Pan troglodytes (Chimpanzee)
      44     103984  Rhizophagus irregularis
      45     103729  Anguilla anguilla (European freshwater eel) (Muraena anguilla)
      46      99349  Pyramidula sp. HNHM
   2.3  Taxonomic distribution of the sequences

image

   Kingdom        sequences (% of the database)
    Archaea         2819920 (  2%)
    Bacteria       88305959 ( 70%)
    Eukaryota      30858905 ( 24%)
    Viruses         3644827 (  3%)
    Other           1150587 ( <1%)



   Within Eukaryota:

image

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 154020 (  0%)           (  0%)
     Other Mammalia       2467795 (  8%)           (  2%)
     Other Vertebrata     2957390 ( 10%)           (  2%)
     Viridiplantae        6602731 ( 21%)           (  5%)
     Fungi                8692756 ( 28%)           (  7%)
     Insecta              3265195 ( 11%)           (  3%)
     Nematoda             1558868 (  5%)           (  1%)
     Other                5160150 ( 17%)           (  4%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50  1759470             1001-1100    875708
                 51- 100 10617073             1101-1200    608093
                101- 150 12759290             1201-1300    409806
                151- 200 12307224             1301-1400    277998
                201- 250 12224350             1401-1500    217164
                251- 300 12159400             1501-1600    158848
                301- 350 11054924             1601-1700    119644
                351- 400  8576822             1701-1800     91884
                401- 450  7291589             1801-1900     79652
                451- 500  5864089             1901-2000     66042
                501- 550  4078522             2001-2100     53565
                551- 600  3099685             2101-2200     51290
                601- 650  2279243             2201-2300     39774
                651- 700  1786838             2301-2400     32789
                701- 750  1523917             2401-2500     27981
                751- 800  1306208             >2500        214129
                801- 850  1010352
                851- 900   869293
                901- 950   664269
                951-1000   509717

image


   The average sequence length in UniProtKB/TrEMBL is   336 amino acids.

   The shortest sequence is     C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  A0A316Q3J5_9FIRM: 74488 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   148613644                1.17                                                    
   Submitted to EMBL/GenBank/DDBJ  98275879  89211502      0.78                                                    
   Journal                         43872043  41454049      0.35                                                    
   Submitted to other databases     6439400   6285132      0.05                                                    
   Thesis                             14950     14891     <0.01                                                    
   Book citation                      11371     11306     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 693515


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     176259741                1.39                                                    
   ACTIVITY REGULATION               387792    387790     <0.01    11                                              
   CATALYTIC ACTIVITY              14833892  13505238      0.12     5                                              
   CAUTION                         71332501  69695382      0.56     1                                              
   COFACTOR                         6998667   6373422      0.06     8                                              
   DOMAIN                           1274127    972770      0.01     9                                              
   FUNCTION                        17294171  16342771      0.14     3                                              
   INTERACTION                         3068      3068     <0.01    12                                              
   MISCELLANEOUS                     745620    671882      0.01    10                                              
   PATHWAY                          7582732   6826112      0.06     7                                              
   SIMILARITY                      31770617  31335178      0.25     2                                              
   SUBCELLULAR LOCATION            15057689  14938074      0.12     4                                              
   SUBUNIT                          8978865   8867716      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     329754698                2.60                                                    
   ACT_SITE                         8007774   4895593      0.06     9                                              
   BINDING                         17203747   4364483      0.14     5                                              
   CARBOHYD                           21396     20244     <0.01    23                                              
   CHAIN                            9390714   9378300      0.07     7                                              
   COILED                          18748655  12501493      0.15     3                                              
   COMPBIAS                            4776      4776     <0.01    26                                              
   CROSSLNK                           36267     34154     <0.01    22                                              
   DISULFID                         2135728    566372      0.02    16                                              
   DNA_BIND                         3257642   2886309      0.03    13                                              
   DOMAIN                          90887553  65656276      0.72     2                                              
   INIT_MET                           56326     56326     <0.01    21                                              
   INTRAMEM                            1206       990     <0.01    27                                              
   LIPID                             365407    209921     <0.01    19                                              
   METAL                           13639870   3625932      0.11     6                                              
   MOD_RES                          2577541   2248918      0.02    14                                              
   MOTIF                            1621837   1098607      0.01    17                                              
   NON_STD                             5729      5536     <0.01    25                                              
   NON_TER                         17586174  11767816      0.14     4                                              
   NP_BIND                          7372304   4659529      0.06    10                                              
   PEPTIDE                              733       467     <0.01    28                                              
   PROPEP                             18014     18014     <0.01    24                                              
   REGION                           5749744   2961851      0.05    11                                              
   REPEAT                           5000076   1194737      0.04    12                                              
   SIGNAL                           9365094   9365084      0.07     8                                              
   SITE                             2187547   1324582      0.02    15                                              
   TOPO_DOM                          331518    154863     <0.01    20                                              
   TRANSIT                              142       142     <0.01    29                                              
   TRANSMEM                       113736215  24985358      0.90     1                                              
   ZN_FING                           444969    351284     <0.01    18                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             1510652756               11.92                                                    
   Allergome                           3947      3182     <0.01    90   Protein family/group databases             
   ArachnoServer                        200       200     <0.01   113   Organism-specific databases                
   Araport                            15185     15119     <0.01    81   Organism-specific databases                
   BRENDA                              9540      9251     <0.01    84   Enzyme and pathway databases               
   Bgee                              531762    531632     <0.01    48   Gene expression databases                  
   BindingDB                            233       233     <0.01   112   Chemistry                                  
   BioCyc                           6051966   6032988      0.05    29   Enzyme and pathway databases               
   CAZy                              128838    120573     <0.01    58   Protein family/group databases             
   CDD                             23038975  20226673      0.18    14   Family and domain databases                
   CGD                                20795     20729     <0.01    79   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   132   2D gel databases                           
   CORUM                                114       114     <0.01   119   Protein-protein interaction databases      
   CTD                              1139644   1137670      0.01    41   Organism-specific databases                
   CarbonylDB                           265       265     <0.01   111   PTM databases                              
   ChEMBL                               965       965     <0.01   104   Chemistry                                  
   ChiTaRS                           131432    131431     <0.01    57   Other                                      
   CollecTF                             195       195     <0.01   114   Gene expression databases                  
   ComplexPortal                        180       131     <0.01   115   Protein-protein interaction databases      
   ConoServer                           158       158     <0.01   116   Organism-specific databases                
   DIP                                 3209      3208     <0.01    92   Protein-protein interaction databases      
   DNASU                              39646     39298     <0.01    71   Protocols and materials databases          
   DisProt                               95        95     <0.01   121   3D structure databases                     
   DrugBank                             769       461     <0.01   105   Chemistry                                  
   ELM                                  101       101     <0.01   120   Protein-protein interaction databases      
   EMBL                           138605977 122636349      1.09     3   Sequence databases                         
   EPD                                13933     13933     <0.01    82   Proteomic databases                        
   ESTHER                             76919     76615     <0.01    63   Protein family/group databases             
   Ensembl                          1911082   1866522      0.02    34   Genome annotation databases                
   EnsemblBacteria                 38881450  36680547      0.31    10   Genome annotation databases                
   EnsemblFungi                     6182712   6076412      0.05    28   Genome annotation databases                
   EnsemblMetazoa                   1150803   1109292      0.01    40   Genome annotation databases                
   EnsemblPlants                    2599339   2263515      0.02    31   Genome annotation databases                
   EnsemblProtists                  1872797   1760853      0.01    35   Genome annotation databases                
   EuPathDB                          678334    677639      0.01    45   Organism-specific databases                
   EvolutionaryTrace                   5934      5934     <0.01    87   Other                                      
   ExpressionAtlas                   681324    680500      0.01    44   Gene expression databases                  
   FlyBase                           208313    207016     <0.01    55   Organism-specific databases                
   GO                             227580579  81654141      1.80     2   Ontologies                                 
   Gene3D                          55718968  46382752      0.44     8   Family and domain databases                
   GeneCards                           1308      1290     <0.01   101   Organism-specific databases                
   GeneDB                            114674    112894     <0.01    60   Genome annotation databases                
   GeneID                          10676994  10570702      0.08    22   Genome annotation databases                
   GeneTree                         1831785   1831704      0.01    36   Phylogenomic databases                     
   Genevisible                        15835     15828     <0.01    80   Gene expression databases                  
   GenomeRNAi                         29979     29979     <0.01    76   Other                                      
   GlyConnect                            13        13     <0.01   126   PTM databases                              
   Gramene                          2205244   2028501      0.02    33   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   131   Chemistry                                  
   H-InvDB                              587       440     <0.01   107   Organism-specific databases                
   HAMAP                           14137467  13978138      0.11    19   Family and domain databases                
   HGNC                               51991     51889     <0.01    68   Organism-specific databases                
   HOGENOM                          2990430   2990337      0.02    30   Phylogenomic databases                     
   HOVERGEN                          300316    300303     <0.01    53   Phylogenomic databases                     
   InParanoid                       2343109   2343109      0.02    32   Phylogenomic databases                     
   IntAct                             25605     25605     <0.01    77   Protein-protein interaction databases      
   InterPro                       325781275  99752067      2.57     1   Family and domain databases                
   KEGG                            16545667  16110364      0.13    17   Genome annotation databases                
   KO                               7283351   7253656      0.06    24   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    96   Organism-specific databases                
   Leproma                             1271      1269     <0.01   102   Organism-specific databases                
   MEROPS                            242202    242200     <0.01    54   Protein family/group databases             
   MGI                                62204     61727     <0.01    65   Organism-specific databases                
   MIM                                    4         4     <0.01   130   Organism-specific databases                
   MINT                                2654      2654     <0.01    95   Protein-protein interaction databases      
   MalaCards                             12        12     <0.01   127   Organism-specific databases                
   MaxQB                              40604     40604     <0.01    70   Proteomic databases                        
   MoonDB                                 1         1     <0.01   136   Protein family/group databases             
   MoonProt                              62        62     <0.01   124   Protein family/group databases             
   OGP                                    3         3     <0.01   133   2D gel databases                           
   OMA                              6835202   6835112      0.05    26   Phylogenomic databases                     
   OpenTargets                        49922     49873     <0.01    69   Organism-specific databases                
   OrthoDB                         14211740  14211618      0.11    18   Phylogenomic databases                     
   PANTHER                         26865943  25937617      0.21    12   Family and domain databases                
   PATRIC                          17243257  17232568      0.14    15   Genome annotation databases                
   PDB                                37683     18390     <0.01    72   3D structure databases                     
   PDBsum                             37142     18061     <0.01    73   3D structure databases                     
   PIR                               160881    128716     <0.01    56   Sequence databases                         
   PIRSF                           11157054  11063218      0.09    21   Family and domain databases                
   PMAP-CutDB                           130       130     <0.01   118   Other                                      
   PRIDE                             323093    323093     <0.01    52   Proteomic databases                        
   PRINTS                          16782185  15144586      0.13    16   Family and domain databases                
   PRO                                 2257      2257     <0.01    98   Other                                      
   PROSITE                         64099901  42764870      0.51     7   Family and domain databases                
   PaxDb                             324660    324660     <0.01    51   Proteomic databases                        
   PeptideAtlas                      128710    128710     <0.01    59   Proteomic databases                        
   PeroxiBase                          2472      2464     <0.01    97   Protein family/group databases             
   Pfam                           125787613  91405537      0.99     4   Family and domain databases                
   PharmGKB                            3132      3132     <0.01    94   Organism-specific databases                
   PhosphoSitePlus                     2237      2237     <0.01    99   PTM databases                              
   PhylomeDB                         461222    461222     <0.01    49   Phylogenomic databases                     
   PomBase                                2         2     <0.01   134   Organism-specific databases                
   ProDom                           1785436   1712752      0.01    37   Family and domain databases                
   ProMEX                              3202      3202     <0.01    93   Proteomic databases                        
   ProteinModelPortal               7166666   7166666      0.06    25   3D structure databases                     
   Proteomes                      105958657  99699821      0.84     5   Other                                      
   PseudoCAP                           4447      4443     <0.01    89   Organism-specific databases                
   REBASE                             31238     31217     <0.01    75   Protein family/group databases             
   REPRODUCTION-2DPAGE                   62        61     <0.01   123   2D gel databases                           
   RGD                                21619     20710     <0.01    78   Organism-specific databases                
   Reactome                          326852    115999     <0.01    50   Enzyme and pathway databases               
   RefSeq                          45142804  44028862      0.36     9   Sequence databases                         
   SABIO-RK                             616       616     <0.01   106   Enzyme and pathway databases               
   SFLD                             1159397    599767      0.01    39   Family and domain databases                
   SGD                                    7         7     <0.01   129   Organism-specific databases                
   SIGNOR                                 7         7     <0.01   128   Enzyme and pathway databases               
   SMART                           30336910  23036150      0.24    11   Family and domain databases                
   SMR                              1345940   1345940      0.01    38   3D structure databases                     
   STRING                           6428594   6428337      0.05    27   Protein-protein interaction databases      
   SUPFAM                          83341188  65986872      0.66     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   135   2D gel databases                           
   SignaLink                           3794      3794     <0.01    91   Enzyme and pathway databases               
   SwissLipids                           82        82     <0.01   122   Chemistry                                  
   SwissPalm                           1888      1888     <0.01   100   PTM databases                              
   TAIR                               11857     11796     <0.01    83   Organism-specific databases                
   TCDB                                8199      8188     <0.01    85   Protein family/group databases             
   TIGRFAMs                        26677899  24542468      0.21    13   Family and domain databases                
   TopDownProteomics                    280       280     <0.01   110   Proteomic databases                        
   TreeFam                           558505    558469     <0.01    47   Phylogenomic databases                     
   TubercuList                         1000       999     <0.01   103   Organism-specific databases                
   UCSC                               93079     92873     <0.01    61   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   125   PTM databases                              
   UniGene                           864328    730201      0.01    42   Sequence databases                         
   UniLectin                            152       152     <0.01   117   Protein family/group databases             
   UniPathway                       7368550   6803294      0.06    23   Enzyme and pathway databases               
   VGNC                               80709     80709     <0.01    62   Organism-specific databases                
   VectorBase                        580495    561771     <0.01    46   Genome annotation databases                
   WBParaSite                        854113    845706      0.01    43   Genome annotation databases                
   World-2DPAGE                         315       310     <0.01   109   2D gel databases                           
   WormBase                           55866     55482     <0.01    66   Organism-specific databases                
   Xenbase                            34592     34511     <0.01    74   Organism-specific databases                
   ZFIN                               54776     54084     <0.01    67   Organism-specific databases                
   dictyBase                           7984      7762     <0.01    86   Organism-specific databases                
   eggNOG                          13769526   6901741      0.11    20   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    64   Organism-specific databases                
   iPTMnet                             5126      5126     <0.01    88   PTM databases                              
   mycoCLAP                             447       447     <0.01   108   Protein family/group databases             

Number of explicitly cross-referenced databases: 156


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.15   Gln (Q) 3.77   Leu (L) 9.89   Ser (S) 6.67
   Arg (R) 5.74   Glu (E) 6.16   Lys (K) 4.95   Thr (T) 5.55
   Asn (N) 3.86   Gly (G) 7.31   Met (M) 2.37   Trp (W) 1.30
   Asp (D) 5.47   His (H) 2.19   Phe (F) 3.92   Tyr (Y) 2.91
   Cys (C) 1.19   Ile (I) 5.69   Pro (P) 4.86   Val (V) 6.89

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.05

image

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Arg, Ile, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 1815900
Total number of entries encoded on a Plasmid: 901246
Total number of entries encoded on a Plastid: 123977
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: