InterPro documentation

Release 0.2, October 1999


Acknowledgments

InterPro has been prepared by:



R.Apweiler (1), T.K.Attwood (4), A.Bairoch (2), A.Bateman (5),

E.Birney (5), M.Biswas (1), P.Bucher (3), M.D.R.Croning (1,4),

W.Fleischmann (1), A.Kanapin (1), Y.Karavidopoulou (1), B.Marx

(1), N.Mulder (1).



(1) EMBL Outstation � European Bioinformatics Institute,

Wellcome Trust Genome Campus, Hinxton, Cambridge, UK;

(2) Swiss Institute for Bioinformatics, Geneva, Switzerland;

(3) Swiss Institute for Experimental Cancer Research,

Lausanne, Switzerland;

(4) School of Biological Sciences, The University of

Manchester, Manchester, UK;

(5) The Sanger Institute, Wellcome Trust Genome Campus, Hinxton,

Cambridge, UK;

(6) CNRS/INRA, Toulouse, France;





1. Introduction



The databases SWISS-PROT, TrEMBL, PROSITE, PRINTS, Pfam, and

ProDom joined forces to launch a new Integrated Resource of

Protein Domains and Functional Sites, abbreviated InterPro. A

detailed description of the project can be found in the

InterPro user manual.





2. Contents of current release



InterPro release 0.1 beta (October 1999) contains 2423

entries, representing 615 domains, 1776 families, 27 repeats,

and 8 post-translational modification sites. 612 entries are

flagged as subtype of other InterPro entries. Overall,

InterPro matches 186255 of 279794 SWISS-PROT + TrEMBL protein

sequences. A complete list is available from the ftp site.



The release was build using the following database versions:



   Database          Version  Entries    Date

   ---------         -------  -------    ------

   SWISS-PROT          38      80000   24-JUL-1999

   TrEMBL              11     199794   03-AUG-1999

   PROSITE             16.0     1370   01-APR-1999

   prelim. profiles     -        241   10-FEB-1999

   Pfam                 4.0     1465   11-MAY-1999

   PRINTS              23.1     1157   15-AUG-1999





3. Changes since the alpha release



InterPro entries describe protein families, domains, repeats,

or post-translational modification sites, whereas the member

databases PROSITE, PRINTS, Pfam provide methods to recognise

these biological objects. To make sure that InterPro entries

and linked methods are pointing to related information on the

same biological object, all links have been checked manually

and corrected where necessary.



We also introduced the concept of type/subtype. If one entry

describes a family and another entry describes a subfamily,

the first one is flagged as parent and the second one as

child. Details can be found in the user manual.



InterPro entries now contain the concatenated annotation taken

from the member databases. This was done automatically without

removing, rephrasing, or merging of redundant or conflicting

information.





4. Forthcoming changes



The first production release 1.0 is scheduled for winter 1999.

In Release 1.0, we will replace the concatenated annotation by

manually curated annotation, checked for conflicts and

redundancies.  It will also contain a one-line-description of

all InterPro entries. A list of example proteins indicating

diversity within the group will be included.





5. Future directions



While the initial InterPro release was created around PRINTS,

PROSITE and Pfam, ProDom will shortly also be included.

Various factors rendered a step-wise approach to the

development of InterPro desirable. First, the scale of the

task of amalgamating just the first three databases was

immense. The rational merging of apparently equivalent

database entries that in fact simultaneously define a specific

family, domains within that family, or even repeats within

those domains, presented an enormous challenge. Thus, the

immediate goal for InterPro was to limit the problem only to

databases that offered annotation. A second important

consideration was that while Pfam, PRINTS and PROSITE are true

pattern databases, ProDom is based solely on automatic

clustering of sequences by similarity (i.e., discriminators

are not derived). Resulting clusters need not have precise

biological correlations and some family designations have

changed between database versions. It was therefore necessary

that ProDom should adopt stable accession numbers before its

entries could be meaningfully considered for inclusion in

InterPro. The full integration of ProDom into InterPro will be

achieved in release 2 (May 2000).



Once the founder members of the InterPro consortium have been

assimilated into the unified resource, other pattern databases

will also be included. First, scheduled for release 3

(November 2000), will be the SMART resource. In addition, the

Blocks database is planning to use InterPro as the basis for

the creation of Blocks. As Blocks does not include annotation

and will be based on families already in InterPro, the process

of cross-referencing between Blocks and InterPro, and even the

full integration of Blocks within InterPro, should be

relatively straightforward. Ultimately, we hope to include

many other protein family databases to give a more

comprehensive view of the resources available. After 18 months

the PRODOM graphical interface will be made available on the

WWW server to browse and search InterPro interactively. It

will provide domain arrangement schemes for all proteins

belonging to a given InterPro entry, or for all proteins

sharing a homologous domain with a given protein.  This

interface, based on XDOM, will also be distributed as a stand-

alone program for Unix platforms. It will facilitate

visualisation and interpretation of protein families. This

will lead to faster annotation of new genomic sequences as

well as sequence entries in SWISS-PROT + TREMBL.





6. We need your help



We welcome any feedback. If you find errors or omissions

please let us  know. You can contact us at Interhelp@ebi.ac.uk.





7. Copyright Notice



InterPro - Integrated Resource Of Protein Domains And

Functional Sites

Copyright (C) 1999 The InterPro Consortium.

This  manual and the accompanying database may be copied and

redistributed freely, without advance permission, provided

that this Copyright statement is reproduced with each copy.