 |
SIFTS initiative
|
What is the SIFTS initiative?
|
|
|
The "Structure integration with function, taxonomy and sequence (SIFTS) initiative" aims
to work towards the integration of various bioinformatics
resources. One of the major obstacles to the improved
integration of structural databases such as
PDBe and sequence databases like
UniProt, which are primary archival databases for structure and
sequence data, is the absence of up to date and
well-maintained mapping between corresponding entries. We
have worked closely with the UniProt group at the EBI to
clean up the taxonomy and sequence cross-reference
information in the PDBe and UniProt databases. The project was
started in the year 2001 and has resulted in creating a robust mechanisms
for exchanging data between the two primary data resources.
This has dramatically improved the quality of annotation in both
databases and is aiding the continuing improvements of legacy data.
In the longer term this project will allow for not only the better
and closer integration of derived-data resources but will continue
to improve the quality of all data in the primary resources.
|
|
This information is vital for the reliable integration of
the sequence family databases such as Pfam and Interpro with
the structure-oriented databases of SCOP and CATH.
This information has been made available to the eFamily group
and now forms the basis of the
regular interchange of information between the member
databases (PDBe, Uniprot, Pfam, Interpro, SCOP and CATH). Figure 2 shows
the database schema designed to store the information.
|
Figure 2
|
|
|
Data Distribution
|
|
These mapping data are available in XML format from the
FTP site.
The complete schema is available
here.
A complete documentation of the schema is available here.
The XML schema was developed under the auspices of the eFamily project,
which is working to facilitate the distribution of domain specific sequence
data and improve the integration of sequence and structure data resources.
The aim of the eFamily schema is allow the different domain definitions and
mappings (between sequence and structure) to be exchanged using the same basic
file format. This schema is designed to wrap up single database entries that can
be downloaded from an ftp site or, better still, exported using Web services. As
well as the domain boundary definitions, the schema also allows any associated
sequence or structural alignments to be encapsulated in xml.
|
|
|
Future plans
|
|
In collaboration with our partners in the eFamily project we plan to
develop a perl interface to the data, which will be made available
under the
Bio-Perl project.
We also plan to develop web-services that will be integrated with other
web services that will be developed by the partners in the eFamily
project. These web services will in future allow clients to develop
work-flows that will help with the integration of different
bioinformatics resources based on the residue level mapping and
annotation provided by the PDBe.
We also plan to improve the annotation of conflicts between the
sequence from macromolecular structures and sequence from its
uniprot cross-reference.
As part of this project a perl API for macromolecular structure
data and uniprot sequence data is developed. At present we are
in the process of developing it further so that it can be made
available in a public domain.
|
|
Primary developers: Sameer Velankar, Harry Boutselakis, Phil McNeil, Antonio Suarez (PDBe group) and Virginie Mittard, Daniel Barrell, Julius Jacobsen (Sequence database group).
Last modified: Tue Nov 3 09:31:48 GMT 2009
Document mantained by: Gaurav Sahni
|
|
|