Current Release Statistics


         UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2018_07 STATISTICS


1.  INTRODUCTION

Release 2018_07 of 18-Jul-2018 of UniProtKB/TrEMBL contains 120243849 sequence entries,
comprising 40506871635 amino acids.

5117258 sequences have been added since release 2018_06, the sequence data of
971 existing entries has been updated and the annotations of
11294535 entries have been revised. This represents an increase of 4%.

Number of fragments: 11171747

Protein existence (PE):              entries      %
1: Evidence at protein level          144063     0.12%
2: Evidence at transcript level      1158492     0.96%
3: Inferred from homology           29451155    24.49%
4: Predicted                        89490139    74.42%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.
image



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 878350

   The first twenty species represent 5875974 sequences:   4.9 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 408607
                            2x: 120763
                            3x:  64377
                            4x:  45757
                            5x:  27814
                            6x:  20013
                            7x:  15007
                            8x:  11802
                            9x:   9719
                           10x:  14969
                       11- 20x:  76332
                       21- 50x:  21413
                       51-100x:  11725
                         >100x:  30052


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     873908  Human immunodeficiency virus 1
       2     668601  marine sediment metagenome
       3     506839  Daphnia magna
       4     442216  Bacillus cereus
       5     324490  Gammaproteobacteria bacterium
       6     307730  Escherichia coli
       7     294620  Helicobacter pylori (Campylobacter pylori)
       8     261883  Euryarchaeota archaeon
       9     260021  Arundo donax (Giant reed) (Donax arundinaceus)
      10     258967  uncultured bacterium
      11     207459  Hordeum vulgare subsp. vulgare (Domesticated barley)
      12     185739  Pseudomonas fluorescens
      13     182247  Chloroflexi bacterium
      14     174449  Rhodospirillaceae bacterium
      15     163349  Fundulus heteroclitus (Killifish) (Mummichog)
      16     156415  Pseudomonas putida (Arthrobacter siderocapsulatus)
      17     154302  Flavobacteriaceae bacterium
      18     152938  Homo sapiens (Human)
      19     151058  Hepacivirus C
      20     148743  Streptococcus pneumoniae
      21     145728  Triticum aestivum (Wheat)
      22     129559  Candidatus Marinimicrobia bacterium
      23     129495  Bacillus thuringiensis
      24     129404  Zea mays (Maize)
      25     127745  mine drainage metagenome
      26     123319  Klebsiella pneumoniae
      27     122903  Hepatitis B virus (HBV)
      28     121109  Mycobacteroides abscessus subsp. abscessus
      29     119529  Stenotrophomonas maltophilia (Pseudomonas maltophilia) (Xanthomonas maltophilia)
      30     118659  Oryza sativa subsp. japonica (Rice)
      31     116737  Acidimicrobiaceae bacterium
      32     116687  Flavobacteriales bacterium
      33     115850  Enterobacter cloacae
      34     114537  Planctomycetaceae bacterium
      35     113860  Deltaproteobacteria bacterium
      36     111944  uncultured Clostridium sp
      37     111477  groundwater metagenome
      38     108765  Pan troglodytes (Chimpanzee)
      39     103975  Rhizophagus irregularis
      40     103727  Anguilla anguilla (European freshwater eel) (Muraena anguilla)
      41      99351  Pyramidula sp. HNHM
   2.3  Taxonomic distribution of the sequences

image

   Kingdom        sequences (% of the database)
    Archaea         2732740 (  2%)
    Bacteria       83006730 ( 69%)
    Eukaryota      29793833 ( 25%)
    Viruses         3567437 (  3%)
    Other           1143109 ( <1%)



   Within Eukaryota:

image

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 153013 (  1%)           (  0%)
     Other Mammalia       2231488 (  7%)           (  2%)
     Other Vertebrata     2903379 ( 10%)           (  2%)
     Viridiplantae        6346684 ( 21%)           (  5%)
     Fungi                8359099 ( 28%)           (  7%)
     Insecta              3165932 ( 11%)           (  3%)
     Nematoda             1558713 (  5%)           (  1%)
     Other                5075525 ( 17%)           (  4%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50  1710527             1001-1100    828586
                 51- 100 10080564             1101-1200    574686
                101- 150 12093887             1201-1300    388732
                151- 200 11659805             1301-1400    264273
                201- 250 11587328             1401-1500    206259
                251- 300 11517310             1501-1600    150962
                301- 350 10461287             1601-1700    113754
                351- 400  8118009             1701-1800     87544
                401- 450  6902145             1801-1900     75861
                451- 500  5555638             1901-2000     62625
                501- 550  3862320             2001-2100     50750
                551- 600  2938724             2101-2200     48896
                601- 650  2160966             2201-2300     37925
                651- 700  1693914             2301-2400     31162
                701- 750  1444168             2401-2500     26642
                751- 800  1239634             >2500        204239
                801- 850   955733
                851- 900   825421
                901- 950   628821
                951-1000   483005

image


   The average sequence length in UniProtKB/TrEMBL is   336 amino acids.

   The shortest sequence is     C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  A0A1V4K6M4_PATFA: 36991 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   141136126                1.17                                                    
   Submitted to EMBL/GenBank/DDBJ  95883749  87398181      0.80                                                    
   Journal                         39239705  36890822      0.33                                                    
   Submitted to other databases     5986342   5959978      0.05                                                    
   Thesis                             14951     14892     <0.01                                                    
   Book citation                      11378     11313     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 680947


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     163807021                1.36                                                    
   CATALYTIC ACTIVITY              13785224  12536308      0.11     5                                              
   CAUTION                         66979763  65495692      0.56     1                                              
   COFACTOR                         6480157   5912314      0.05     8                                              
   DOMAIN                            918156    869065      0.01     9                                              
   ENZYME REGULATION                 330204    330202     <0.01    11                                              
   FUNCTION                        15921833  15159638      0.13     3                                              
   INTERACTION                         3080      3080     <0.01    12                                              
   MISCELLANEOUS                     546370    537982     <0.01    10                                              
   PATHWAY                          7011048   6313502      0.06     7                                              
   SIMILARITY                      29515102  29113756      0.25     2                                              
   SUBCELLULAR LOCATION            13987834  13952950      0.12     4                                              
   SUBUNIT                          8328250   8237115      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     305436236                2.54                                                    
   ACT_SITE                         7147764   4346467      0.06     9                                              
   BINDING                         15675913   4026691      0.13     5                                              
   CARBOHYD                           21296     20153     <0.01    23                                              
   CHAIN                            8669599   8657761      0.07     7                                              
   COILED                          17624213  11719738      0.15     3                                              
   COMPBIAS                            4470      4470     <0.01    26                                              
   CROSSLNK                           28217     26267     <0.01    22                                              
   DISULFID                         1892583    506517      0.02    16                                              
   DNA_BIND                         3001289   2659607      0.02    13                                              
   DOMAIN                          84311796  60813376      0.70     2                                              
   INIT_MET                           37770     37770     <0.01    21                                              
   INTRAMEM                            1132       928     <0.01    27                                              
   LIPID                             293700    154070     <0.01    19                                              
   METAL                           12624102   3358137      0.10     6                                              
   MOD_RES                          2263476   2033902      0.02    14                                              
   MOTIF                            1383766    965010      0.01    17                                              
   NON_STD                             5009      4816     <0.01    25                                              
   NON_TER                         16880148  11222669      0.14     4                                              
   NP_BIND                          6758001   4267490      0.06    10                                              
   PEPTIDE                              682       417     <0.01    28                                              
   PROPEP                             17089     17089     <0.01    24                                              
   REGION                           5025770   2675109      0.04    11                                              
   REPEAT                           4684076   1119371      0.04    12                                              
   SIGNAL                           8645150   8645140      0.07     8                                              
   SITE                             1951657   1156875      0.02    15                                              
   TOPO_DOM                          257614     99875     <0.01    20                                              
   TRANSIT                              133       133     <0.01    29                                              
   TRANSMEM                       105810969  23302915      0.88     1                                              
   ZN_FING                           418852    330372     <0.01    18                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             1427957991               11.88                                                    
   Allergome                           3949      3184     <0.01    90   Protein family/group databases             
   ArachnoServer                        200       200     <0.01   113   Organism-specific databases                
   Araport                            15238     15171     <0.01    81   Organism-specific databases                
   BRENDA                              9568      9278     <0.01    84   Enzyme and pathway databases               
   Bgee                              534182    533882     <0.01    48   Gene expression databases                  
   BindingDB                            260       260     <0.01   112   Chemistry                                  
   BioCyc                           6073629   6055343      0.05    29   Enzyme and pathway databases               
   CAZy                              129092    120803     <0.01    58   Protein family/group databases             
   CDD                             21099329  18559937      0.18    14   Family and domain databases                
   CGD                                20805     20739     <0.01    79   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   130   2D gel databases                           
   CORUM                                114       114     <0.01   119   Protein-protein interaction databases      
   CTD                              1137852   1135895      0.01    40   Organism-specific databases                
   CarbonylDB                           265       265     <0.01   111   PTM databases                              
   ChEMBL                               965       965     <0.01   104   Chemistry                                  
   ChiTaRS                           131488    131487     <0.01    57   Other                                      
   CollecTF                             199       199     <0.01   114   Gene expression databases                  
   ComplexPortal                        157       121     <0.01   117   Protein-protein interaction databases      
   ConoServer                           159       159     <0.01   115   Organism-specific databases                
   DIP                                 3218      3217     <0.01    93   Protein-protein interaction databases      
   DNASU                              41306     40867     <0.01    71   Protocols and materials databases          
   DisProt                               96        96     <0.01   121   3D structure databases                     
   DrugBank                             742       449     <0.01   105   Chemistry                                  
   ELM                                  107       107     <0.01   120   Protein-protein interaction databases      
   EMBL                           130901137 116382363      1.09     3   Sequence databases                         
   EPD                                14152     14152     <0.01    82   Proteomic databases                        
   ESTHER                             74552     74254     <0.01    64   Protein family/group databases             
   Ensembl                          1908039   1864114      0.02    34   Genome annotation databases                
   EnsemblBacteria                 39180104  36966740      0.33    10   Genome annotation databases                
   EnsemblFungi                     6182800   6076500      0.05    28   Genome annotation databases                
   EnsemblMetazoa                   1178955   1126765      0.01    39   Genome annotation databases                
   EnsemblPlants                    2164456   1964933      0.02    33   Genome annotation databases                
   EnsemblProtists                  1872786   1760841      0.02    35   Genome annotation databases                
   EuPathDB                          671630    670935      0.01    44   Organism-specific databases                
   EvolutionaryTrace                   5945      5945     <0.01    87   Other                                      
   ExpressionAtlas                   643201    642942      0.01    45   Gene expression databases                  
   FlyBase                           208353    207056     <0.01    55   Organism-specific databases                
   GO                             219143245  78578233      1.82     2   Ontologies                                 
   Gene3D                          51722212  43061762      0.43     8   Family and domain databases                
   GeneCards                           1315      1296     <0.01   101   Organism-specific databases                
   GeneDB                            114676    112896     <0.01    60   Genome annotation databases                
   GeneID                          10485780  10379434      0.09    21   Genome annotation databases                
   GeneTree                         1831584   1831506      0.02    36   Phylogenomic databases                     
   Genevisible                        15848     15841     <0.01    80   Gene expression databases                  
   GenomeRNAi                         30028     30028     <0.01    76   Other                                      
   GlyConnect                            13        13     <0.01   126   PTM databases                              
   Gramene                          2164456   1964933      0.02    32   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   131   Chemistry                                  
   H-InvDB                              587       440     <0.01   107   Organism-specific databases                
   HAMAP                           13159293  13011552      0.11    20   Family and domain databases                
   HGNC                               52006     51910     <0.01    68   Organism-specific databases                
   HOGENOM                          2998504   2998423      0.02    30   Phylogenomic databases                     
   HOVERGEN                          300387    300374     <0.01    52   Phylogenomic databases                     
   InParanoid                       2347050   2347050      0.02    31   Phylogenomic databases                     
   IntAct                             28727     28727     <0.01    77   Protein-protein interaction databases      
   InterPro                       302546206  92727589      2.52     1   Family and domain databases                
   KEGG                            16130860  15708492      0.13    16   Genome annotation databases                
   KO                               7077362   7048433      0.06    24   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    96   Organism-specific databases                
   Leproma                             1271      1269     <0.01   102   Organism-specific databases                
   MEROPS                            243563    243562     <0.01    54   Protein family/group databases             
   MGI                                61803     61368     <0.01    65   Organism-specific databases                
   MIM                                    4         4     <0.01   132   Organism-specific databases                
   MINT                                2630      2630     <0.01    95   Protein-protein interaction databases      
   MalaCards                             12        12     <0.01   127   Organism-specific databases                
   MaxQB                              42428     42428     <0.01    70   Proteomic databases                        
   MoonDB                                 1         1     <0.01   136   Protein family/group databases             
   MoonProt                              64        64     <0.01   123   Protein family/group databases             
   OGP                                    3         3     <0.01   133   2D gel databases                           
   OMA                              6855057   6854980      0.06    25   Phylogenomic databases                     
   OpenTargets                        49941     49892     <0.01    69   Organism-specific databases                
   OrthoDB                         14257002  14256882      0.12    18   Phylogenomic databases                     
   PANTHER                         24589279  23734061      0.20    13   Family and domain databases                
   PATRIC                          17407631  17400486      0.14    15   Genome annotation databases                
   PDB                                37078     18219     <0.01    72   3D structure databases                     
   PDBsum                             36527     17874     <0.01    73   3D structure databases                     
   PIR                               162693    130450     <0.01    56   Sequence databases                         
   PIRSF                           10366576  10279751      0.09    22   Family and domain databases                
   PMAP-CutDB                           130       130     <0.01   118   Other                                      
   PRIDE                             341047    341047     <0.01    50   Proteomic databases                        
   PRINTS                          15677508  14138235      0.13    17   Family and domain databases                
   PRO                                 2260      2260     <0.01    98   Other                                      
   PROSITE                         59822228  39872218      0.50     7   Family and domain databases                
   PaxDb                             328130    328130     <0.01    51   Proteomic databases                        
   PeptideAtlas                      129018    129018     <0.01    59   Proteomic databases                        
   PeroxiBase                          2475      2467     <0.01    97   Protein family/group databases             
   Pfam                           116412442  84576904      0.97     4   Family and domain databases                
   PharmGKB                            3136      3136     <0.01    94   Organism-specific databases                
   PhosphoSitePlus                     2242      2242     <0.01    99   PTM databases                              
   PhylomeDB                         461358    461358     <0.01    49   Phylogenomic databases                     
   PomBase                                2         2     <0.01   134   Organism-specific databases                
   ProDom                           1696586   1624401      0.01    37   Family and domain databases                
   ProMEX                              3275      3275     <0.01    92   Proteomic databases                        
   ProteinModelPortal               7207227   7207227      0.06    23   3D structure databases                     
   Proteomes                       99793454  94698954      0.83     5   Other                                      
   PseudoCAP                           4449      4445     <0.01    89   Organism-specific databases                
   REBASE                             31478     31467     <0.01    75   Protein family/group databases             
   REPRODUCTION-2DPAGE                   62        61     <0.01   124   2D gel databases                           
   RGD                                21625     20732     <0.01    78   Organism-specific databases                
   Reactome                          276229    100066     <0.01    53   Enzyme and pathway databases               
   RefSeq                          44180454  43100764      0.37     9   Sequence databases                         
   SABIO-RK                             633       633     <0.01   106   Enzyme and pathway databases               
   SFLD                             1075775    559125      0.01    41   Family and domain databases                
   SGD                                    7         7     <0.01   128   Organism-specific databases                
   SIGNOR                                 7         7     <0.01   129   Enzyme and pathway databases               
   SMART                           28297826  21502817      0.24    11   Family and domain databases                
   SMR                              1203395   1203395      0.01    38   3D structure databases                     
   STRING                           6443483   6443234      0.05    27   Protein-protein interaction databases      
   SUPFAM                          77449650  61319406      0.64     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   135   2D gel databases                           
   SignaLink                           3798      3798     <0.01    91   Enzyme and pathway databases               
   SwissLipids                           82        82     <0.01   122   Chemistry                                  
   SwissPalm                           1973      1973     <0.01   100   PTM databases                              
   TAIR                               11899     11837     <0.01    83   Organism-specific databases                
   TCDB                                8168      8157     <0.01    85   Protein family/group databases             
   TIGRFAMs                        24698286  22720987      0.21    12   Family and domain databases                
   TopDownProteomics                    280       280     <0.01   110   Proteomic databases                        
   TreeFam                           558543    558507     <0.01    47   Phylogenomic databases                     
   TubercuList                         1000       999     <0.01   103   Organism-specific databases                
   UCSC                               93162     92957     <0.01    61   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   125   PTM databases                              
   UniGene                           866109    732349      0.01    42   Sequence databases                         
   UniLectin                            158       158     <0.01   116   Protein family/group databases             
   UniPathway                       6814400   6292474      0.06    26   Enzyme and pathway databases               
   VGNC                               78529     78529     <0.01    62   Organism-specific databases                
   VectorBase                        578280    559614     <0.01    46   Genome annotation databases                
   WBParaSite                        854112    845705      0.01    43   Genome annotation databases                
   World-2DPAGE                         316       311     <0.01   109   2D gel databases                           
   WormBase                           55874     55490     <0.01    66   Organism-specific databases                
   Xenbase                            34432     34353     <0.01    74   Organism-specific databases                
   ZFIN                               53083     52461     <0.01    67   Organism-specific databases                
   dictyBase                           7987      7765     <0.01    86   Organism-specific databases                
   eggNOG                          13813818   6924169      0.11    19   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    63   Organism-specific databases                
   iPTMnet                             5143      5143     <0.01    88   PTM databases                              
   mycoCLAP                             447       447     <0.01   108   Protein family/group databases             

Number of explicitly cross-referenced databases: 156


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.10   Gln (Q) 3.78   Leu (L) 9.88   Ser (S) 6.68
   Arg (R) 5.71   Glu (E) 6.17   Lys (K) 4.99   Thr (T) 5.55
   Asn (N) 3.88   Gly (G) 7.29   Met (M) 2.38   Trp (W) 1.29
   Asp (D) 5.47   His (H) 2.19   Phe (F) 3.93   Tyr (Y) 2.92
   Cys (C) 1.20   Ile (I) 5.71   Pro (P) 4.85   Val (V) 6.88

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.05

image

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Ile, Arg, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 1734055
Total number of entries encoded on a Plasmid: 866093
Total number of entries encoded on a Plastid: 121387
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: