Current Release Statistics


         UniProtKB/TrEMBL PROTEIN DATABASE RELEASE 2017_01 STATISTICS


1.  INTRODUCTION

Release 2017_01 of 18-Jan-2017 of UniProtKB/TrEMBL contains 73711881 sequence entries,
comprising 24751112664 amino acids.

3031100 sequences have been added since release 2016_11, the sequence data of
746 existing entries has been updated and the annotations of
20906527 entries have been revised. This represents an increase of 4%.

Number of fragments: 8492670

Protein existence (PE):              entries      %
1: Evidence at protein level          125544     0.17%
2: Evidence at transcript level      1063419     1.44%
3: Inferred from homology           17236726    23.38%
4: Predicted                        55286192    75.00%
5: Uncertain                               0     0.00%

The growth of the database is summarized below.
image



2.  TAXONOMIC ORIGIN

   Total number of species represented in this release of UniProtKB/TrEMBL: 703505

   The first twenty species represent 4428729 sequences:     6 % of the
   total number of entries.


   2.1 Table of the frequency of occurrence of species

        Species represented 1x:29375
                            2x:10917
                            3x:58419
                            4x:42078
                            5x:25330
                            6x:18490
                            7x:13715
                            8x:10719
                            9x: 8755
                           10x:13716
                       11- 20x:62763
                       21- 50x:19287
                       51-100x: 8259
                         >100x:19050


   2.2  Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1     743808  Human immunodeficiency virus 1
       2     668601  marine sediment metagenome
       3     506807  Daphnia magna
       4     259928  Arundo donax (Giant reed) (Donax arundinaceus)
       5     201351  uncultured bacterium
       6     191907  Escherichia coli
       7     179540  Bacillus cereus
       8     165724  Fundulus heteroclitus (Killifish) (Mummichog)
       9     156305  Streptococcus pneumoniae
      10     146583  Pseudomonas fluorescens
      11     145614  Triticum aestivum (Wheat)
      12     139437  Zea mays (Maize)
      13     136539  Homo sapiens (Human)
      14     128762  Hepatitis C virus
      15     119110  Oryza sativa subsp. japonica (Rice)
      16     111944  uncultured Clostridium sp
      17     111477  groundwater metagenome
      18     110681  Hepatitis B virus (HBV)
      19     103700  Anguilla anguilla (European freshwater eel) (Muraena anguilla)
      20     100911  Klebsiella pneumoniae
      21     100850  Brassica napus (Rape)
      22     100018  Glycine max (Soybean) (Glycine hispida)
      23      99349  Pyramidula sp. HNHM
   2.3  Taxonomic distribution of the sequences

image

   Kingdom        sequences (% of the database)
    Archaea         1432418 (  2%)
    Bacteria       46458949 ( 63%)
    Eukaryota      21908067 ( 30%)
    Viruses         2884012 (  4%)
    Other           1028435 ( <1%)



   Within Eukaryota:

image

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                 136614 (  1%)           (  0%)
     Other Mammalia       1239074 (  6%)           (  2%)
     Other Vertebrata     2340820 ( 11%)           (  3%)
     Viridiplantae        4086417 ( 19%)           (  6%)
     Fungi                6015696 ( 27%)           (  8%)
     Insecta              2551923 ( 12%)           (  3%)
     Nematoda             1364919 (  6%)           (  2%)
     Other                4172604 ( 19%)           (  6%)



3.  SEQUENCE SIZE

   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50 1353932             1001-1100   503957
                 51- 100 6223584             1101-1200   352596
                101- 150 7218674             1201-1300   244351
                151- 200 6867902             1301-1400   169357
                201- 250 6818788             1401-1500   132885
                251- 300 6735015             1501-1600    96889
                301- 350 6120514             1601-1700    73190
                351- 400 4736486             1701-1800    57078
                401- 450 4027942             1801-1900    48511
                451- 500 3309923             1901-2000    40266
                501- 550 2313610             2001-2100    33117
                551- 600 1778374             2101-2200    32398
                601- 650 1299783             2201-2300    24550
                651- 700 1022540             2301-2400    19752
                701- 750  876752             2401-2500    16994
                751- 800  765524             >2500       132964
                801- 850  581076
                851- 900  512927
                901- 950  381776
                951-1000  295234

image


   The average sequence length in UniProtKB/TrEMBL is   335 amino acids.

   The shortest sequence is C4PYW0_SCHMA:     2 amino acids.
   The longest sequence is  Q3ASY8_CHLCH: 36805 amino acids.



4.  STATISTICS FOR SOME LINE TYPES

The following table summarizes the total number of some UniProtKB/TrEMBL lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                    88497697                1.20                                                    
   Submitted to EMBL/GenBank/DDBJ  54751378  48913874      0.74                                                    
   Journal                         29844613  27899152      0.40                                                    
   Submitted to other databases     3878756   3841896      0.05                                                    
   Thesis                             11690     11631     <0.01                                                    
   Book citation                      11259     11194     <0.01                                                    
   Patent                                 1         1     <0.01                                                    

Total number of distinct authors cited in UniProtKB/TrEMBL: 611902


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Comments (CC)                     100669058                1.37                                                    
   CATALYTIC ACTIVITY               7963620   7342142      0.11     5                                              
   CAUTION                         36956222  36173747      0.50     1                                              
   COFACTOR                         3396905   3109019      0.05     8                                              
   DOMAIN                            499783    480492      0.01     9                                              
   ENZYME REGULATION                 159595    159595     <0.01    11                                              
   FUNCTION                         8835022   8556278      0.12     4                                              
   INTERACTION                         1850      1850     <0.01    12                                              
   MISCELLANEOUS                     269908    265888     <0.01    10                                              
   PATHWAY                          4095806   3733819      0.06     7                                              
   SIMILARITY                      24481629  21687260      0.33     2                                              
   SUBCELLULAR LOCATION             9258008   8409483      0.13     3                                              
   SUBUNIT                          4750710   4726486      0.06     6                                              

Total number of comment topics: 12


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank
---------------------------------  -------- ---------  ---------  ----

Features (FT)                     176692242                2.40                                                    
   ACT_SITE                         3662638   2259169      0.05     9                                              
   BINDING                          7729126   1987211      0.10     4                                              
   CARBOHYD                            3790      2031     <0.01    24                                              
   CHAIN                            5149227   5147440      0.07     6                                              
   COILED                           4904980   3268831      0.07     8                                              
   COMPBIAS                            3043      3043     <0.01    25                                              
   CROSSLNK                           15444     14367     <0.01    21                                              
   DISULFID                          725931    193664      0.01    16                                              
   DNA_BIND                         1772120   1567918      0.02    13                                              
   DOMAIN                          51317401  36952417      0.70     2                                              
   INIT_MET                           17612     17612     <0.01    20                                              
   INTRAMEM                             316       118     <0.01    27                                              
   LIPID                              13821     12125     <0.01    22                                              
   METAL                            6247400   1682094      0.08     5                                              
   MOD_RES                          1183801   1082035      0.02    14                                              
   MOTIF                             415981    319510      0.01    17                                              
   NON_STD                             2391      2200     <0.01    26                                              
   NON_TER                         13268977   8501601      0.18     3                                              
   NP_BIND                          3605578   2353132      0.05    10                                              
   PEPTIDE                               56        56     <0.01    29                                              
   PROPEP                              9174      9174     <0.01    23                                              
   REGION                           2274962   1208745      0.03    12                                              
   REPEAT                           2871356    695718      0.04    11                                              
   SIGNAL                           5145660   5145659      0.07     7                                              
   SITE                              855301    457743      0.01    15                                              
   TOPO_DOM                           75295     23355     <0.01    19                                              
   TRANSIT                               85        85     <0.01    28                                              
   TRANSMEM                        65161730  14451033      0.88     1                                              
   ZN_FING                           259046    201772     <0.01    18                                              

Total number of feature keys: 29


                                   Total    Number of  Average
Line type / subtype                number   entries    per entry  Rank  Category
---------------------------------  -------- ---------  ---------  ----  -------------------------------------------
Cross-references (DR)             904776184               12.27                                                    
   Allergome                           3846      3128     <0.01    89   Protein family/group databases             
   ArachnoServer                        204       204     <0.01   108   Organism-specific databases                
   BRENDA                              9632      9343     <0.01    82   Enzyme and pathway databases               
   Bgee                              360459    360417     <0.01    48   Gene expression databases                  
   BindingDB                            509       509     <0.01   104   Chemistry                                  
   BioCyc                           3591862   3586786      0.05    29   Enzyme and pathway databases               
   CAZy                              129532    121229     <0.01    56   Protein family/group databases             
   CDD                              7690788   7400886      0.10    20   Family and domain databases                
   CGD                                16327     16270     <0.01    80   Organism-specific databases                
   COMPLUYEAST-2DPAGE                     4         4     <0.01   122   2D gel databases                           
   CTD                               734266    732544      0.01    41   Organism-specific databases                
   ChEMBL                               858       858     <0.01   101   Chemistry                                  
   ChiTaRS                            86370     86210     <0.01    59   Other                                      
   CollecTF                             199       199     <0.01   109   Gene expression databases                  
   ConoServer                           159       159     <0.01   111   Organism-specific databases                
   DIP                                 3277      3272     <0.01    93   Protein-protein interaction databases      
   DNASU                              39717     39395     <0.01    69   Protocols and materials databases          
   DrugBank                             160        70     <0.01   110   Chemistry                                  
   EMBL                            81228037  71254314      1.10     3   Sequence databases                         
   EPD                                 7662      7662     <0.01    84   Proteomic databases                        
   ESTHER                             54863     54746     <0.01    65   Protein family/group databases             
   Ensembl                          1219252   1198985      0.02    35   Genome annotation databases                
   EnsemblBacteria                 36265571  32095708      0.49    10   Genome annotation databases                
   EnsemblFungi                     4527353   4352889      0.06    27   Genome annotation databases                
   EnsemblMetazoa                   1151448   1041909      0.02    38   Genome annotation databases                
   EnsemblPlants                    1759538   1645108      0.02    33   Genome annotation databases                
   EnsemblProtists                  1838825   1704749      0.02    32   Genome annotation databases                
   EuPathDB                          564062    564062      0.01    45   Organism-specific databases                
   EvolutionaryTrace                   6051      6051     <0.01    86   Other                                      
   ExpressionAtlas                   225587    225587     <0.01    51   Gene expression databases                  
   FlyBase                           222957    221489     <0.01    52   Organism-specific databases                
   GO                             136719697  47973855      1.85     2   Ontologies                                 
   Gene3D                          43987440  34756019      0.60     7   Family and domain databases                
   GeneDB                             62119     61155     <0.01    63   Genome annotation databases                
   GeneID                           8056905   7965084      0.11    18   Genome annotation databases                
   GeneTree                         1202164   1202025      0.02    36   Phylogenomic databases                     
   Genevisible                        16405     16405     <0.01    79   Gene expression databases                  
   GenomeRNAi                         30429     30429     <0.01    74   Other                                      
   Gramene                          1758073   1643761      0.02    34   Genome annotation databases                
   GuidetoPHARMACOLOGY                    4         4     <0.01   120   Chemistry                                  
   H-InvDB                              591       444     <0.01   102   Organism-specific databases                
   HAMAP                            6888353   6799032      0.09    22   Family and domain databases                
   HGNC                               49893     49802     <0.01    68   Organism-specific databases                
   HOGENOM                          3054417   3054303      0.04    30   Phylogenomic databases                     
   HOVERGEN                          300932    300921     <0.01    49   Phylogenomic databases                     
   InParanoid                       2539825   2539712      0.03    31   Phylogenomic databases                     
   IntAct                             19619     19619     <0.01    77   Protein-protein interaction databases      
   InterPro                       163068065  56322615      2.21     1   Family and domain databases                
   KEGG                            12900791  12525579      0.18    15   Genome annotation databases                
   KO                               5493568   5470520      0.07    26   Phylogenomic databases                     
   LegioList                           2496      2483     <0.01    95   Organism-specific databases                
   Leproma                             1271      1269     <0.01    98   Organism-specific databases                
   MEROPS                            202826    202825     <0.01    54   Protein family/group databases             
   MGI                                58977     58574     <0.01    64   Organism-specific databases                
   MIM                                    4         4     <0.01   123   Organism-specific databases                
   MINT                                9810      9809     <0.01    81   Protein-protein interaction databases      
   MalaCards                              9         9     <0.01   117   Organism-specific databases                
   MaxQB                              39147     39147     <0.01    70   Proteomic databases                        
   MoonProt                               4         4     <0.01   121   Protein family/group databases             
   OGP                                    3         3     <0.01   124   2D gel databases                           
   OMA                              6496986   6496947      0.09    23   Phylogenomic databases                     
   OpenTargets                        53729     50787     <0.01    66   Organism-specific databases                
   OrthoDB                         13836518  13836450      0.19    14   Phylogenomic databases                     
   PANTHER                         10981961  10612450      0.15    16   Family and domain databases                
   PATRIC                           5580801   5580697      0.08    25   Genome annotation databases                
   PDB                                30681     15618     <0.01    73   3D structure databases                     
   PDBsum                             30752     15625     <0.01    72   3D structure databases                     
   PIR                               161807    129623     <0.01    55   Sequence databases                         
   PIRSF                            5815314   5763039      0.08    24   Family and domain databases                
   PMAP-CutDB                           131       131     <0.01   112   Other                                      
   PRIDE                             290700    290694     <0.01    50   Proteomic databases                        
   PRINTS                          10163641   9139796      0.14    17   Family and domain databases                
   PRO                                 2281      2281     <0.01    97   Other                                      
   PROSITE                         36647079  24262204      0.50     9   Family and domain databases                
   PaxDb                             605478    605075      0.01    43   Proteomic databases                        
   PeptideAtlas                      127314    127314     <0.01    57   Proteomic databases                        
   PeroxiBase                          2471      2463     <0.01    96   Protein family/group databases             
   Pfam                            70959645  51662315      0.96     4   Family and domain databases                
   PharmGKB                            3168      3168     <0.01    94   Organism-specific databases                
   PhosphoSitePlus                     3589      3589     <0.01    91   PTM databases                              
   PhylomeDB                         500422    500422      0.01    47   Phylogenomic databases                     
   PomBase                               33        33     <0.01   115   Organism-specific databases                
   ProDom                           1167566   1108830      0.02    37   Family and domain databases                
   ProMEX                              3314      3314     <0.01    92   Proteomic databases                        
   ProteinModelPortal               7780057   7779467      0.11    19   3D structure databases                     
   Proteomes                       60463237  58138047      0.82     5   Other                                      
   PseudoCAP                           4473      4467     <0.01    88   Organism-specific databases                
   REBASE                             32795     32781     <0.01    71   Protein family/group databases             
   REPRODUCTION-2DPAGE                   64        63     <0.01   114   2D gel databases                           
   RGD                                24851     23587     <0.01    76   Organism-specific databases                
   Reactome                          218640     80818     <0.01    53   Enzyme and pathway databases               
   RefSeq                          37652581  36860149      0.51     8   Sequence databases                         
   SABIO-RK                             557       557     <0.01   103   Enzyme and pathway databases               
   SFLD                               66845     66790     <0.01    61   Family and domain databases                
   SGD                                    7         7     <0.01   118   Organism-specific databases                
   SIGNOR                                 5         5     <0.01   119   Enzyme and pathway databases               
   SMART                           17252798  13148770      0.23    11   Family and domain databases                
   SMR                               552575    552575      0.01    46   3D structure databases                     
   STRING                           7230032   7225679      0.10    21   Protein-protein interaction databases      
   SUPFAM                          45652474  36331723      0.62     6   Family and domain databases                
   SWISS-2DPAGE                           1         1     <0.01   125   2D gel databases                           
   SignaLink                           3837      3837     <0.01    90   Enzyme and pathway databases               
   SwissLipids                           69        69     <0.01   113   Chemistry                                  
   SwissPalm                           1222      1221     <0.01    99   PTM databases                              
   TAIR                               18976     18859     <0.01    78   Organism-specific databases                
   TCDB                                7587      7571     <0.01    85   Protein family/group databases             
   TIGRFAMs                        14324151  13148657      0.19    13   Family and domain databases                
   TopDownProteomics                    283       283     <0.01   107   Proteomic databases                        
   TreeFam                           577949    577936      0.01    44   Phylogenomic databases                     
   TubercuList                         1008      1007     <0.01   100   Organism-specific databases                
   UCSC                               94627     94434     <0.01    58   Genome annotation databases                
   UniCarbKB                             17        17     <0.01   116   PTM databases                              
   UniGene                           685716    590308      0.01    42   Sequence databases                         
   UniPathway                       4007366   3726131      0.05    28   Enzyme and pathway databases               
   VectorBase                       1015599    551435      0.01    39   Genome annotation databases                
   WBParaSite                        866109    856978      0.01    40   Genome annotation databases                
   World-2DPAGE                         318       313     <0.01   106   2D gel databases                           
   WormBase                           66315     65781     <0.01    62   Organism-specific databases                
   Xenbase                            25538     25477     <0.01    75   Organism-specific databases                
   ZFIN                               52780     52112     <0.01    67   Organism-specific databases                
   dictyBase                           7990      7768     <0.01    83   Organism-specific databases                
   eggNOG                          14337406   7185617      0.19    12   Phylogenomic databases                     
   euHCVdb                            75267     75264     <0.01    60   Organism-specific databases                
   iPTMnet                             5021      5021     <0.01    87   PTM databases                              
   mycoCLAP                             448       448     <0.01   105   Protein family/group databases             

Number of explicitly cross-referenced databases: 145


5.  AMINO ACID COMPOSITION

   5.1  Composition in percent for the complete database

   Ala (A) 8.94   Gln (Q) 3.84   Leu (L) 9.84   Ser (S) 6.80
   Arg (R) 5.67   Glu (E) 6.15   Lys (K) 5.05   Thr (T) 5.58
   Asn (N) 3.94   Gly (G) 7.17   Met (M) 2.40   Trp (W) 1.29
   Asp (D) 5.43   His (H) 2.22   Phe (F) 3.93   Tyr (Y) 2.94
   Cys (C) 1.26   Ile (I) 5.70   Pro (P) 4.86   Val (V) 6.82

   Asx (B) 0      Glx (Z) 0      Xaa (X) 0.05

image

   Legend: gray = aliphatic, red = acidic, green = small hydroxy,
           blue = basic, black = aromatic, white = amide, yellow = sulfur


   5.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Val, Ser, Glu, Ile, Arg, Thr, Asp, Lys, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Trp, Cys



6.  MISCELLANEOUS STATISTICS

Total number of entries encoded on a Mitochondrion: 1470054
Total number of entries encoded on a Plasmid: 599393
Total number of entries encoded on a Plastid: 74934
Total number of entries encoded on a Plastid; Apicoplast: 30
Total number of entries encoded on a Plastid; Chloroplast: 63
Total number of entries encoded on a Plastid; Cyanelle: 
Total number of entries encoded on a Plastid; Non-photosynthetic plastid: