spacer

SIFTS initiative

What is the SIFTS initiative?

The "Structure integration with function, taxonomy and sequence (SIFTS) initiative" aims to work towards the integration of various bioinformatics resources. One of the major obstacles to the improved integration of structural databases such as PDBe and sequence databases like UniProt, which are primary archival databases for structure and sequence data, is the absence of up to date and well-maintained mapping between corresponding entries. We have worked closely with the UniProt group at the EBI to clean up the taxonomy and sequence cross-reference information in the PDBe and UniProt databases. The project was started in the year 2001 and has resulted in creating a robust mechanisms for exchanging data between the two primary data resources.

This has dramatically improved the quality of annotation in both databases and is aiding the continuing improvements of legacy data. In the longer term this project will allow for not only the better and closer integration of derived-data resources but will continue to improve the quality of all data in the primary resources.


This information is vital for the reliable integration of the sequence family databases such as Pfam and Interpro with the structure-oriented databases of SCOP and CATH. This information has been made available to the eFamily group and now forms the basis of the regular interchange of information between the member databases (PDBe, Uniprot, Pfam, Interpro, SCOP and CATH). Figure 2 shows the database schema designed to store the information.


Figure 2
Data Distribution

These mapping data are available in XML format from the FTP site. The complete schema is available here. A complete documentation of the schema is available here.

The XML schema was developed under the auspices of the eFamily project, which is working to facilitate the distribution of domain specific sequence data and improve the integration of sequence and structure data resources. The aim of the eFamily schema is allow the different domain definitions and mappings (between sequence and structure) to be exchanged using the same basic file format. This schema is designed to wrap up single database entries that can be downloaded from an ftp site or, better still, exported using Web services. As well as the domain boundary definitions, the schema also allows any associated sequence or structural alignments to be encapsulated in xml.


Future plans

In collaboration with our partners in the eFamily project we plan to develop a perl interface to the data, which will be made available under the Bio-Perl project.

We also plan to develop web-services that will be integrated with other web services that will be developed by the partners in the eFamily project. These web services will in future allow clients to develop work-flows that will help with the integration of different bioinformatics resources based on the residue level mapping and annotation provided by the PDBe.

We also plan to improve the annotation of conflicts between the sequence from macromolecular structures and sequence from its uniprot cross-reference.

As part of this project a perl API for macromolecular structure data and uniprot sequence data is developed. At present we are in the process of developing it further so that it can be made available in a public domain.


Primary developers: Sameer Velankar, Harry Boutselakis, Phil McNeil, Antonio Suarez (PDBe group) and Virginie Mittard, Daniel Barrell, Julius Jacobsen (Sequence database group).
Last modified: Tue Nov 3 09:31:48 GMT 2009
Document mantained by: Gaurav Sahni
TEMBLOR-European Community Contract No. QLRI-CT-2001-00015 Medical Research Council home page EMBL Heidelberg home page
spacer
spacer