Current Release Statistics


         UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2017_07 STATISTICS


1.  INTRODUCTION

Release 2017_07 of 05-Jul-2017 of UniProtKB/TrEMBL contains 88032926 sequence entries,
comprising 29627301199 amino acids.

1882496 sequences have been added since release 2017_06, the sequence data of
183 existing entries has been updated and the annotations of
14161645 entries have been revised. This represents an increase of 2%.

Number of fragments: 9008702

Protein existence (PE):              entries      %
1: Evidence at protein level          129607     0.15%
2: Evidence at transcript level      1086717     1.23%
3: Inferred from homology           21458867    24.38%
4: Predicted                        65357735    74.24%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.
image



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 747492

   The first twenty species represent 4627039 sequences:   5.3 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x:31364
                            2x:11445
                            3x:60802
                            4x:43296
                            5x:26192
                            6x:19146
                            7x:14289
                            8x:11177
                            9x: 9128
                           10x:14227
                       11- 20x:68671
                       21- 50x:20104
                       51-100x: 9076
                         >100x:23286


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     794093  Human immunodeficiency virus 1
       2     668601  marine sediment metagenome
       3     506808  Daphnia magna
       4     260048  Arundo donax (Giant reed) (Donax arundinaceus)
       5     222882  uncultured bacterium
       6     212730  Escherichia coli
       7     209114  Bacillus cereus
       8     165730  Fundulus heteroclitus (Killifish) (Mummichog)
       9     164387  Pseudomonas fluorescens
      10     153553  Streptococcus pneumoniae
      11     145216  Triticum aestivum (Wheat)
      12     139338  Homo sapiens (Human)
      13     131362  Hepatitis C virus
      14     129376  Zea mays (Maize)
      15     128576  Helicobacter pylori (Campylobacter pylori)
      16     127745  mine drainage metagenome
      17     121625  Mycobacterium abscessus subsp. abscessus
      18     118924  Oryza sativa subsp. japonica (Rice)
      19     114294  Hepatitis B virus (HBV)
      20     112637  Pseudomonas putida (Arthrobacter siderocapsulatus)
      21     111944  uncultured Clostridium sp
      22     111477  groundwater metagenome
      23     103712  Anguilla anguilla (European freshwater eel) (Muraena anguilla)
      24     100860  Brassica napus (Rape)
      25     100062  Glycine max (Soybean) (Glycine hispida)
      26      99351  Pyramidula sp. HNHM
   2.3  Taxonomic distribution of the sequences

image

   Kingdom        sequences (% of the database)
    Archaea         1693366 (  2%)
    Bacteria       57908565 ( 66%)
    Eukaryota      24201072 ( 27%)
    Viruses         3104959 (  4%)
    Other           1124964 ( <1%)



   Within Eukaryota:

image

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 139413 (  1%)           (  0%)
     Other Mammalia       1359143 (  6%)           (  2%)
     Other Vertebrata     2573332 ( 11%)           (  3%)
     Viridiplantae        4743272 ( 20%)           (  5%)
     Fungi                6809456 ( 28%)           (  8%)
     Insecta              2716349 ( 11%)           (  3%)
     Nematoda             1370849 (  6%)           (  2%)
     Other                4489258 ( 19%)           (  5%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50 1454963             1001-1100   606190
                 51- 100 7456735             1101-1200   422290
                101- 150 8779492             1201-1300   289307
                151- 200 8385898             1301-1400   199868
                201- 250 8325968             1401-1500   156309
                251- 300 8231766             1501-1600   113873
                301- 350 7472580             1601-1700    85936
                351- 400 5792197             1701-1800    66838
                401- 450 4924460             1801-1900    57063
                451- 500 4007007             1901-2000    47556
                501- 550 2801460             2001-2100    38998
                551- 600 2147247             2101-2200    37961
                601- 650 1569406             2201-2300    28919
                651- 700 1233866             2301-2400    23488
                701- 750 1053835             2401-2500    20198
                751- 800  914964             >2500       157437
                801- 850  697806
                851- 900  611262
                901- 950  457487
                951-1000  353594

image


   The average sequence length in UniProtKB/TrEMBL is   336 amino acids.

   The shortest sequence is C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  A0A1V4K6M4_P: 36991 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   104551643                1.19                                                    
   Submitted to EMBL/GenBank/DDBJ  65322454  58838812      0.74                                                    
   Journal                         34302016  32261882      0.39                                                    
   Submitted to other databases     4902835   4879629      0.06                                                    
   Thesis                             13077     13018     <0.01                                                    
   Book citation                      11260     11195     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 635398


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     116538630                1.32                                                    
   CATALYTIC ACTIVITY              10019180   9184098      0.11     5                                              
   CAUTION                         45516769  44508380      0.52     1                                              
   COFACTOR                         4542693   4167797      0.05     8                                              
   DOMAIN                            611509    588128      0.01     9                                              
   ENZYME REGULATION                 193228    193226     <0.01    11                                              
   FUNCTION                        11356368  10854197      0.13     3                                              
   INTERACTION                         2281      2281     <0.01    12                                              
   MISCELLANEOUS                     339978    335286     <0.01    10                                              
   PATHWAY                          5047751   4592838      0.06     7                                              
   SIMILARITY                      21562424  21301697      0.24     2                                              
   SUBCELLULAR LOCATION            11331492  10394013      0.13     4                                              
   SUBUNIT                          6014957   5939582      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     214054945                2.43                                                    
   ACT_SITE                         4770023   2907831      0.05     9                                              
   BINDING                         10349722   2667721      0.12     4                                              
   CARBOHYD                            1564       899     <0.01    26                                              
   CHAIN                            5849713   5847591      0.07     7                                              
   COILED                           5872466   3939474      0.07     6                                              
   COMPBIAS                            3512      3512     <0.01    24                                              
   CROSSLNK                           19358     17693     <0.01    21                                              
   DISULFID                          907457    244632      0.01    15                                              
   DNA_BIND                         2142993   1890070      0.02    13                                              
   DOMAIN                          63075002  45566344      0.72     2                                              
   INIT_MET                           21817     21817     <0.01    20                                              
   INTRAMEM                             955       745     <0.01    27                                              
   LIPID                              16182     14556     <0.01    22                                              
   METAL                            8201373   2205977      0.09     5                                              
   MOD_RES                           735501    711283      0.01    16                                              
   MOTIF                             515780    377200      0.01    17                                              
   NON_STD                             2687      2496     <0.01    25                                              
   NON_TER                         14040958   9047478      0.16     3                                              
   NP_BIND                          4538727   2938720      0.05    10                                              
   PEPTIDE                               73        73     <0.01    29                                              
   PROPEP                             10928     10928     <0.01    23                                              
   REGION                           3102961   1624708      0.04    11                                              
   REPEAT                           2416603    676344      0.03    12                                              
   SIGNAL                           5845523   5845514      0.07     8                                              
   SITE                             1142847    644752      0.01    14                                              
   TOPO_DOM                           88877     29584     <0.01    19                                              
   TRANSIT                               94        94     <0.01    28                                              
   TRANSMEM                        80102316  17682979      0.91     1                                              
   ZN_FING                           278933    211381     <0.01    18                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             1068529605               12.14                                                    
   Allergome                           3874      3143     <0.01    89   Protein family/group databases             
   ArachnoServer                        203       203     <0.01   109   Organism-specific databases                
   Araport                            19694     19610     <0.01    77   Organism-specific databases                
   BRENDA                              9647      9356     <0.01    82   Enzyme and pathway databases               
   Bgee                              359770    359720     <0.01    49   Gene expression databases                  
   BindingDB                            224       224     <0.01   108   Chemistry                                  
   BioCyc                           3465001   3463759      0.04    29   Enzyme and pathway databases               
   CAZy                              129629    121314     <0.01    57   Protein family/group databases             
   CDD                             11044844  10521118      0.13    19   Family and domain databases                
   CGD                                20816     20750     <0.01    76   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   121   2D gel databases                           
   CTD                               777639    775758      0.01    41   Organism-specific databases                
   ChEMBL                               871       871     <0.01   101   Chemistry                                  
   ChiTaRS                            86196     86037     <0.01    61   Other                                      
   CollecTF                             202       202     <0.01   110   Gene expression databases                  
   ConoServer                           160       160     <0.01   111   Organism-specific databases                
   DIP                                 3288      3282     <0.01    91   Protein-protein interaction databases      
   DNASU                              41380     40941     <0.01    70   Protocols and materials databases          
   DrugBank                             614       344     <0.01   102   Chemistry                                  
   EMBL                            96231329  84943349      1.09     3   Sequence databases                         
   EPD                                 7173      7173     <0.01    85   Proteomic databases                        
   ESTHER                             70484     70188     <0.01    63   Protein family/group databases             
   Ensembl                          1226214   1203465      0.01    36   Genome annotation databases                
   EnsemblBacteria                 41082366  38844489      0.47     9   Genome annotation databases                
   EnsemblFungi                     5494382   5343811      0.06    27   Genome annotation databases                
   EnsemblMetazoa                   1074571   1049370      0.01    38   Genome annotation databases                
   EnsemblPlants                    1754979   1643279      0.02    33   Genome annotation databases                
   EnsemblProtists                  1858061   1749167      0.02    32   Genome annotation databases                
   EuPathDB                          583473    583473      0.01    44   Organism-specific databases                
   EvolutionaryTrace                   6024      6024     <0.01    86   Other                                      
   ExpressionAtlas                   279003    279003     <0.01    51   Gene expression databases                  
   FlyBase                           222759    221294     <0.01    55   Organism-specific databases                
   GO                             161879834  57566525      1.84     2   Ontologies                                 
   Gene3D                          35429612  29824332      0.40    10   Family and domain databases                
   GeneDB                            114837    113058     <0.01    59   Genome annotation databases                
   GeneID                           9791505   9682315      0.11    20   Genome annotation databases                
   GeneTree                         1207053   1206921      0.01    37   Phylogenomic databases                     
   Genevisible                        16351     16351     <0.01    79   Gene expression databases                  
   GenomeRNAi                         30316     30316     <0.01    73   Other                                      
   Gramene                          1714552   1608181      0.02    34   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   120   Chemistry                                  
   H-InvDB                              590       443     <0.01   104   Organism-specific databases                
   HAMAP                            8833737   8722252      0.10    21   Family and domain databases                
   HGNC                               50604     50510     <0.01    67   Organism-specific databases                
   HOGENOM                          3046771   3046676      0.03    30   Phylogenomic databases                     
   HOVERGEN                          300691    300679     <0.01    50   Phylogenomic databases                     
   InParanoid                       2505312   2505207      0.03    31   Phylogenomic databases                     
   IntAct                             18676     18676     <0.01    78   Protein-protein interaction databases      
   InterPro                       198116949  69228719      2.25     1   Family and domain databases                
   KEGG                            13362501  12973867      0.15    17   Genome annotation databases                
   KO                               5732883   5708852      0.07    26   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    94   Organism-specific databases                
   Leproma                             1271      1269     <0.01    98   Organism-specific databases                
   MEROPS                            251406    251405     <0.01    53   Protein family/group databases             
   MGI                                59940     59561     <0.01    65   Organism-specific databases                
   MIM                                    4         4     <0.01   122   Organism-specific databases                
   MINT                                9753      9752     <0.01    81   Protein-protein interaction databases      
   MalaCards                              9         9     <0.01   117   Organism-specific databases                
   MaxQB                              41696     41696     <0.01    69   Proteomic databases                        
   MoonProt                               3         3     <0.01   123   Protein family/group databases             
   OGP                                    3         3     <0.01   124   2D gel databases                           
   OMA                              6514040   6514033      0.07    25   Phylogenomic databases                     
   OpenTargets                        48598     48552     <0.01    68   Organism-specific databases                
   OrthoDB                         14613365  14613334      0.17    14   Phylogenomic databases                     
   PANTHER                         14048962  13502209      0.16    16   Family and domain databases                
   PATRIC                          18447027  18446944      0.21    12   Genome annotation databases                
   PDB                                33458     16598     <0.01    71   3D structure databases                     
   PIR                               163297    131045     <0.01    56   Sequence databases                         
   PIRSF                            7492793   7431056      0.09    23   Family and domain databases                
   PMAP-CutDB                           131       131     <0.01   112   Other                                      
   PRIDE                             277155    277155     <0.01    52   Proteomic databases                        
   PRINTS                          12085699  10895703      0.14    18   Family and domain databases                
   PRO                                 2256      2256     <0.01    96   Other                                      
   PROSITE                         44642340  29647733      0.51     7   Family and domain databases                
   PaxDb                             602041    602041      0.01    43   Proteomic databases                        
   PeptideAtlas                      119289    119289     <0.01    58   Proteomic databases                        
   PeroxiBase                          2482      2474     <0.01    95   Protein family/group databases             
   Pfam                            86684308  63069635      0.98     4   Family and domain databases                
   PharmGKB                            3154      3154     <0.01    92   Organism-specific databases                
   PhosphoSitePlus                     2236      2236     <0.01    97   PTM databases                              
   PhylomeDB                         470686    470686      0.01    48   Phylogenomic databases                     
   PomBase                               32        32     <0.01   115   Organism-specific databases                
   ProDom                           1354198   1290045      0.02    35   Family and domain databases                
   ProMEX                              3060      3060     <0.01    93   Proteomic databases                        
   ProteinModelPortal               7582600   7582600      0.09    22   3D structure databases                     
   Proteomes                       75513537  72403856      0.86     5   Other                                      
   PseudoCAP                           4463      4457     <0.01    88   Organism-specific databases                
   REBASE                             32370     32352     <0.01    72   Protein family/group databases             
   REPRODUCTION-2DPAGE                   63        62     <0.01   114   2D gel databases                           
   RGD                                25124     23797     <0.01    75   Organism-specific databases                
   Reactome                          241301     87877     <0.01    54   Enzyme and pathway databases               
   RefSeq                          42378104  41415839      0.48     8   Sequence databases                         
   SABIO-RK                             605       605     <0.01   103   Enzyme and pathway databases               
   SFLD                              572184    376458      0.01    47   Family and domain databases                
   SGD                                    7         7     <0.01   119   Organism-specific databases                
   SIGNOR                                 8         8     <0.01   118   Enzyme and pathway databases               
   SMART                           21193511  16126437      0.24    11   Family and domain databases                
   SMR                              1044906   1044906      0.01    39   3D structure databases                     
   STRING                           6562153   6562152      0.07    24   Protein-protein interaction databases      
   SUPFAM                          57176943  45250905      0.65     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   125   2D gel databases                           
   SignaLink                           3819      3819     <0.01    90   Enzyme and pathway databases               
   SwissLipids                           76        76     <0.01   113   Chemistry                                  
   SwissPalm                           1219      1219     <0.01    99   PTM databases                              
   TAIR                               15894     15816     <0.01    80   Organism-specific databases                
   TCDB                                7735      7719     <0.01    84   Protein family/group databases             
   TIGRFAMs                        17959275  16504903      0.20    13   Family and domain databases                
   TopDownProteomics                    283       283     <0.01   107   Proteomic databases                        
   TreeFam                           577719    577705      0.01    45   Phylogenomic databases                     
   TubercuList                         1005      1004     <0.01   100   Organism-specific databases                
   UCSC                               94109     93914     <0.01    60   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   116   PTM databases                              
   UniGene                           717086    617124      0.01    42   Sequence databases                         
   UniPathway                       4934836   4583273      0.06    28   Enzyme and pathway databases               
   VectorBase                        572261    553502      0.01    46   Genome annotation databases                
   WBParaSite                        854108    845701      0.01    40   Genome annotation databases                
   World-2DPAGE                         317       312     <0.01   106   2D gel databases                           
   WormBase                           65802     65412     <0.01    64   Organism-specific databases                
   Xenbase                            26629     26571     <0.01    74   Organism-specific databases                
   ZFIN                               53008     52353     <0.01    66   Organism-specific databases                
   dictyBase                           7988      7766     <0.01    83   Organism-specific databases                
   eggNOG                          14243014   7138674      0.16    15   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    62   Organism-specific databases                
   iPTMnet                             4970      4970     <0.01    87   PTM databases                              
   mycoCLAP                             448       448     <0.01   105   Protein family/group databases             

Number of explicitly cross-referenced databases: 147


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.02   Gln (Q) 3.80   Leu (L) 9.85   Ser (S) 6.74
   Arg (R) 5.70   Glu (E) 6.16   Lys (K) 5.03   Thr (T) 5.57
   Asn (N) 3.91   Gly (G) 7.22   Met (M) 2.39   Trp (W) 1.29
   Asp (D) 5.44   His (H) 2.20   Phe (F) 3.93   Tyr (Y) 2.94
   Cys (C) 1.23   Ile (I) 5.71   Pro (P) 4.86   Val (V) 6.86

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.05

image

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Ile, Arg, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 1545716
Total number of entries encoded on a Plasmid: 686317
Total number of entries encoded on a Plastid: 95492
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: