Current Release Statistics


         UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2017_11 STATISTICS


1.  INTRODUCTION

Release 2017_11 of 22-Nov-2017 of UniProtKB/TrEMBL contains 98705220 sequence entries,
comprising 33221844964 amino acids.

5772810 sequences have been added since release 2017_10, the sequence data of
35356 existing entries has been updated and the annotations of
25972373 entries have been revised. This represents an increase of 6%.

Number of fragments: 9485379

Protein existence (PE):              entries      %
1: Evidence at protein level          134075     0.14%
2: Evidence at transcript level      1102156     1.12%
3: Inferred from homology           23578044    23.89%
4: Predicted                        73890945    74.86%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.
image



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 782892

   The first twenty species represent 4891040 sequences:     5 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x: 335833
                            2x: 116738
                            3x:  62113
                            4x:  44295
                            5x:  26790
                            6x:  19470
                            7x:  14522
                            8x:  11387
                            9x:   9339
                           10x:  14473
                       11- 20x:  71860
                       21- 50x:  20483
                       51-100x:   9723
                         >100x:  25866


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     818730  Human immunodeficiency virus 1
       2     668601  marine sediment metagenome
       3     506809  Daphnia magna
       4     287830  Escherichia coli
       5     260016  Arundo donax (Giant reed) (Donax arundinaceus)
       6     241743  uncultured bacterium
       7     235380  Bacillus cereus
       8     207270  Hordeum vulgare subsp. vulgare (Domesticated barley)
       9     169503  Pseudomonas fluorescens
      10     165735  Fundulus heteroclitus (Killifish) (Mummichog)
      11     157810  Streptococcus pneumoniae
      12     145442  Triticum aestivum (Wheat)
      13     140799  Homo sapiens (Human)
      14     133690  Hepatitis C virus
      15     132089  Helicobacter pylori (Campylobacter pylori)
      16     129385  Zea mays (Maize)
      17     127745  mine drainage metagenome
      18     122437  Pseudomonas putida (Arthrobacter siderocapsulatus)
      19     121257  Mycobacterium abscessus subsp. abscessus
      20     118769  Oryza sativa subsp. japonica (Rice)
      21     116858  Hepatitis B virus (HBV)
      22     111944  uncultured Clostridium sp
      23     111477  groundwater metagenome
      24     103721  Anguilla anguilla (European freshwater eel) (Muraena anguilla)
      25     101239  Enterobacter cloacae
      26     100085  Glycine max (Soybean) (Glycine hispida)
      27      99347  Pyramidula sp. HNHM
   2.3  Taxonomic distribution of the sequences

image

   Kingdom        sequences (% of the database)
    Archaea         1943400 (  2%)
    Bacteria       66780477 ( 68%)
    Eukaryota      25602119 ( 26%)
    Viruses         3246927 (  3%)
    Other           1132297 ( <1%)



   Within Eukaryota:

image

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 140874 (  1%)           (  0%)
     Other Mammalia       1417384 (  6%)           (  1%)
     Other Vertebrata     2637295 ( 10%)           (  3%)
     Viridiplantae        5019167 ( 20%)           (  5%)
     Fungi                7443677 ( 29%)           (  8%)
     Insecta              2814352 ( 11%)           (  3%)
     Nematoda             1434433 (  6%)           (  1%)
     Other                4694937 ( 18%)           (  5%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50  1563003             1001-1100    680451
                 51- 100  8337325             1101-1200    472439
                101- 150  9880806             1201-1300    321624
                151- 200  9484329             1301-1400    220062
                201- 250  9440993             1401-1500    172544
                251- 300  9367261             1501-1600    125693
                301- 350  8506111             1601-1700     94377
                351- 400  6583923             1701-1800     73157
                401- 450  5597581             1801-1900     62881
                451- 500  4534121             1901-2000     51711
                501- 550  3160939             2001-2100     42542
                551- 600  2415085             2101-2200     41158
                601- 650  1766517             2201-2300     31444
                651- 700  1388185             2301-2400     25882
                701- 750  1184137             2401-2500     22053
                751- 800  1026162             >2500        170290
                801- 850   783913
                851- 900   682637
                901- 950   513791
                951-1000   394714

image


   The average sequence length in UniProtKB/TrEMBL is   336 amino acids.

   The shortest sequence is     C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  A0A1V4K6M4_PATFA: 36991 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                   116268893                1.18                                                    
   Submitted to EMBL/GenBank/DDBJ  75897624  68659688      0.77                                                    
   Journal                         35291859  33181264      0.36                                                    
   Submitted to other databases     5055068   5030911      0.05                                                    
   Thesis                             13035     12976     <0.01                                                    
   Book citation                      11306     11241     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 650276


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     128010375                1.30                                                    
   CATALYTIC ACTIVITY              11027698  10038574      0.11     5                                              
   CAUTION                         49803073  48621435      0.50     1                                              
   COFACTOR                         5066143   4639424      0.05     8                                              
   DOMAIN                            681604    653391      0.01     9                                              
   ENZYME REGULATION                 208341    208339     <0.01    11                                              
   FUNCTION                        12570582  11982998      0.13     3                                              
   INTERACTION                         2922      2922     <0.01    12                                              
   MISCELLANEOUS                     378306    372717     <0.01    10                                              
   PATHWAY                          5584741   5054071      0.06     7                                              
   SIMILARITY                      23694240  23391950      0.24     2                                              
   SUBCELLULAR LOCATION            12410092  11323785      0.13     4                                              
   SUBUNIT                          6582633   6502676      0.07     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     233109831                2.36                                                    
   ACT_SITE                         5415437   3346498      0.05     9                                              
   BINDING                         11494407   2985989      0.12     4                                              
   CARBOHYD                            2886      1855     <0.01    26                                              
   CHAIN                            5969426   5967165      0.06     7                                              
   COILED                           6252081   4214261      0.06     6                                              
   COMPBIAS                            3838      3838     <0.01    24                                              
   CROSSLNK                           21199     19384     <0.01    21                                              
   DISULFID                          953525    265109      0.01    16                                              
   DNA_BIND                         2470432   2188856      0.03    13                                              
   DOMAIN                          67879154  49008485      0.69     2                                              
   INIT_MET                           24073     24073     <0.01    20                                              
   INTRAMEM                            1028       812     <0.01    27                                              
   LIPID                              17049     15246     <0.01    22                                              
   METAL                            9106415   2454381      0.09     5                                              
   MOD_RES                          1570090   1487180      0.02    14                                              
   MOTIF                             606426    454846      0.01    17                                              
   NON_STD                             3211      3020     <0.01    25                                              
   NON_TER                         14698902   9526412      0.15     3                                              
   NP_BIND                          5086351   3259964      0.05    10                                              
   PEPTIDE                              126       126     <0.01    28                                              
   PROPEP                             12719     12719     <0.01    23                                              
   REGION                           3492546   1842347      0.04    12                                              
   REPEAT                           3738192    905066      0.04    11                                              
   SIGNAL                           5964963   5964954      0.06     8                                              
   SITE                             1279701    732845      0.01    15                                              
   TOPO_DOM                          100957     32774     <0.01    19                                              
   TRANSIT                               93        93     <0.01    29                                              
   TRANSMEM                        86607361  19047399      0.88     1                                              
   ZN_FING                           337243    265391     <0.01    18                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             1188306932               12.04                                                    
   Allergome                           3881      3142     <0.01    90   Protein family/group databases             
   ArachnoServer                        201       201     <0.01   111   Organism-specific databases                
   Araport                            19505     19421     <0.01    79   Organism-specific databases                
   BRENDA                              9623      9330     <0.01    83   Enzyme and pathway databases               
   Bgee                              546890    546737      0.01    48   Gene expression databases                  
   BindingDB                            202       202     <0.01   110   Chemistry                                  
   BioCyc                           3452587   3451350      0.03    29   Enzyme and pathway databases               
   CAZy                              129382    121081     <0.01    57   Protein family/group databases             
   CDD                             16629761  14658973      0.17    15   Family and domain databases                
   CGD                                20814     20748     <0.01    77   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   125   2D gel databases                           
   CTD                               899437    897516      0.01    40   Organism-specific databases                
   ChEMBL                               885       885     <0.01   103   Chemistry                                  
   ChiTaRS                            86104     85945     <0.01    61   Other                                      
   CollecTF                             200       200     <0.01   112   Gene expression databases                  
   ConoServer                           160       160     <0.01   113   Organism-specific databases                
   DIP                                 3237      3236     <0.01    92   Protein-protein interaction databases      
   DNASU                              41351     40912     <0.01    70   Protocols and materials databases          
   DisProt                               96        96     <0.01   115   3D structure databases                     
   DrugBank                             640       355     <0.01   104   Chemistry                                  
   EMBL                           108182916  95500163      1.10     3   Sequence databases                         
   EPD                                 9398      9398     <0.01    84   Proteomic databases                        
   ESTHER                             74016     73735     <0.01    63   Protein family/group databases             
   Ensembl                          1285699   1246100      0.01    36   Genome annotation databases                
   EnsemblBacteria                 40803436  38536530      0.41    10   Genome annotation databases                
   EnsemblFungi                     6510181   6122894      0.07    25   Genome annotation databases                
   EnsemblMetazoa                   1099921   1071535      0.01    39   Genome annotation databases                
   EnsemblPlants                    1977846   1811007      0.02    32   Genome annotation databases                
   EnsemblProtists                  1893587   1780546      0.02    34   Genome annotation databases                
   EuPathDB                          634831    634681      0.01    44   Organism-specific databases                
   EvolutionaryTrace                   6004      6004     <0.01    88   Other                                      
   ExpressionAtlas                   369245    369113     <0.01    50   Gene expression databases                  
   FlyBase                           222653    221280     <0.01    55   Organism-specific databases                
   GO                             174875558  62346956      1.77     2   Ontologies                                 
   Gene3D                          42241554  35434292      0.43     9   Family and domain databases                
   GeneCards                           1537      1517     <0.01    99   Organism-specific databases                
   GeneDB                            114834    113054     <0.01    59   Genome annotation databases                
   GeneID                          10316716  10208235      0.10    20   Genome annotation databases                
   GeneTree                         1233658   1233507      0.01    37   Phylogenomic databases                     
   Genevisible                        15915     15908     <0.01    80   Gene expression databases                  
   GenomeRNAi                         30254     30254     <0.01    75   Other                                      
   Gramene                          1942808   1810508      0.02    33   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   123   Chemistry                                  
   H-InvDB                              590       443     <0.01   106   Organism-specific databases                
   HAMAP                            9672990   9550372      0.10    21   Family and domain databases                
   HGNC                               50779     50684     <0.01    67   Organism-specific databases                
   HOGENOM                          3036286   3036196      0.03    30   Phylogenomic databases                     
   HOVERGEN                          300625    300613     <0.01    51   Phylogenomic databases                     
   InParanoid                       2461067   2461067      0.02    31   Phylogenomic databases                     
   IntAct                             19732     19732     <0.01    78   Protein-protein interaction databases      
   InterPro                       240757270  74983348      2.44     1   Family and domain databases                
   KEGG                            14665712  14262393      0.15    16   Genome annotation databases                
   KO                               6314488   6288695      0.06    27   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    95   Organism-specific databases                
   Leproma                             1271      1269     <0.01   100   Organism-specific databases                
   MEROPS                            248838    248837     <0.01    53   Protein family/group databases             
   MGI                                60975     60597     <0.01    65   Organism-specific databases                
   MIM                                    4         4     <0.01   124   Organism-specific databases                
   MINT                                9722      9721     <0.01    82   Protein-protein interaction databases      
   MalaCards                              9         9     <0.01   120   Organism-specific databases                
   MaxQB                              43142     43142     <0.01    69   Proteomic databases                        
   MoonProt                               3         3     <0.01   126   Protein family/group databases             
   OGP                                    3         3     <0.01   127   2D gel databases                           
   OMA                              6432647   6432534      0.07    26   Phylogenomic databases                     
   OpenTargets                        48812     48763     <0.01    68   Organism-specific databases                
   OrthoDB                         14508498  14508408      0.15    17   Phylogenomic databases                     
   PANTHER                         16809958  16225611      0.17    14   Family and domain databases                
   PATRIC                          18167704  18167621      0.18    13   Genome annotation databases                
   PDB                                34891     17297     <0.01    71   3D structure databases                     
   PDBsum                             34004     16759     <0.01    73   3D structure databases                     
   PIR                               163044    130801     <0.01    56   Sequence databases                         
   PIRSF                            8311614   8243562      0.08    22   Family and domain databases                
   PMAP-CutDB                           131       131     <0.01   114   Other                                      
   PRIDE                             274534    274534     <0.01    52   Proteomic databases                        
   PRINTS                          12909728  11646521      0.13    19   Family and domain databases                
   PRO                                 2209      2209     <0.01    98   Other                                      
   PROSITE                         48074458  31967602      0.49     7   Family and domain databases                
   PaxDb                             593499    593499      0.01    45   Proteomic databases                        
   PeptideAtlas                      117090    117090     <0.01    58   Proteomic databases                        
   PeroxiBase                          2481      2473     <0.01    96   Protein family/group databases             
   Pfam                            93671520  68031243      0.95     4   Family and domain databases                
   PharmGKB                            3155      3155     <0.01    93   Organism-specific databases                
   PhosphoSitePlus                     2284      2284     <0.01    97   PTM databases                              
   PhylomeDB                         469528    469528     <0.01    49   Phylogenomic databases                     
   PomBase                               31        31     <0.01   118   Organism-specific databases                
   ProDom                           1456940   1389322      0.01    35   Family and domain databases                
   ProMEX                              2678      2678     <0.01    94   Proteomic databases                        
   ProteinModelPortal               7466650   7466650      0.08    23   3D structure databases                     
   Proteomes                       84838943  81556180      0.86     5   Other                                      
   PseudoCAP                           4452      4448     <0.01    89   Organism-specific databases                
   REBASE                             32037     32022     <0.01    74   Protein family/group databases             
   REPRODUCTION-2DPAGE                   63        62     <0.01   117   2D gel databases                           
   RGD                                25120     23777     <0.01    76   Organism-specific databases                
   Reactome                          241237     86490     <0.01    54   Enzyme and pathway databases               
   RefSeq                          43605107  42651688      0.44     8   Sequence databases                         
   SABIO-RK                             624       624     <0.01   105   Enzyme and pathway databases               
   SFLD                              845285    443211      0.01    43   Family and domain databases                
   SGD                                    7         7     <0.01   122   Organism-specific databases                
   SIGNOR                                 8         8     <0.01   121   Enzyme and pathway databases               
   SMART                           22782208  17339859      0.23    11   Family and domain databases                
   SMR                              1113788   1113788      0.01    38   3D structure databases                     
   STRING                           6520471   6520362      0.07    24   Protein-protein interaction databases      
   SUPFAM                          62992936  49616924      0.64     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   128   2D gel databases                           
   SignaLink                           3806      3806     <0.01    91   Enzyme and pathway databases               
   SwissLipids                           82        82     <0.01   116   Chemistry                                  
   SwissPalm                           1218      1218     <0.01   101   PTM databases                              
   TAIR                               15736     15658     <0.01    81   Organism-specific databases                
   TCDB                                7953      7938     <0.01    86   Protein family/group databases             
   TIGRFAMs                        19612038  18018179      0.20    12   Family and domain databases                
   TopDownProteomics                    281       281     <0.01   109   Proteomic databases                        
   TreeFam                           568181    568148      0.01    46   Phylogenomic databases                     
   TubercuList                         1004      1003     <0.01   102   Organism-specific databases                
   UCSC                               93751     93553     <0.01    60   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   119   PTM databases                              
   UniGene                           846411    717778      0.01    42   Sequence databases                         
   UniPathway                       5445612   5043516      0.06    28   Enzyme and pathway databases               
   VectorBase                        557258    540626      0.01    47   Genome annotation databases                
   WBParaSite                        854112    845705      0.01    41   Genome annotation databases                
   World-2DPAGE                         316       311     <0.01   108   2D gel databases                           
   WormBase                           65522     65133     <0.01    64   Organism-specific databases                
   Xenbase                            34316     34256     <0.01    72   Organism-specific databases                
   ZFIN                               53579     53220     <0.01    66   Organism-specific databases                
   dictyBase                           7987      7765     <0.01    85   Organism-specific databases                
   eggNOG                          14161822   7098150      0.14    18   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    62   Organism-specific databases                
   iPTMnet                             6308      6308     <0.01    87   PTM databases                              
   mycoCLAP                             447       447     <0.01   107   Protein family/group databases             

Number of explicitly cross-referenced databases: 149


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 9.10   Gln (Q) 3.79   Leu (L) 9.87   Ser (S) 6.69
   Arg (R) 5.71   Glu (E) 6.16   Lys (K) 4.99   Thr (T) 5.57
   Asn (N) 3.88   Gly (G) 7.26   Met (M) 2.38   Trp (W) 1.29
   Asp (D) 5.45   His (H) 2.19   Phe (F) 3.92   Tyr (Y) 2.93
   Cys (C) 1.21   Ile (I) 5.70   Pro (P) 4.85   Val (V) 6.88

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.04

image

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Arg, Ile, Thr, Asp, Lys, Pro, Phe, Asn,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 1596745
Total number of entries encoded on a Plasmid: 776489
Total number of entries encoded on a Plastid: 108197
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: