Documentation for the eFamily Schema
Schema Version: 2004-08-14
Last Modification: 2010-09-21
Contributers: Sameer Velankar, Jose M. Dana, Rob Finn, Dave Howorth, Andreas Prlic
The eFamily project is designed to integrate the information contained in five of the major protein databases (CATH, Interpro, MSD, Pfam and SCOP). The databases CATH, SCOP, Interpro, and Pfam contain information describing protein domains. The domain definitions of the former two databases are based on protien structure, while the latter two domain databases are based on protein sequence. The MSD database is the primary data warehouse for data exchange and integration, containing the fundamental mapping from protein sequence (UniProt) to structure (PDB).
Although the different domain databases offer related views of proteins, it is often difficult for biologists to navigate from protein sequence to protein structure and back again. The aim of this project is to provide the scientific community with a coherent and rich view of protein families that allow users to seamlessly navigate between the worlds of protein structure and protein sequence. In this project we are developing data exchange mechanisms and services that exploit grid utilities.
The aim of the eFamily schema is allow the different domain definitions and mappings (between sequence and structure) to be exchanged using the same basic file format. This schema is designed to wrap up single database entries that can be downloaded from an ftp site or, better still, exported using Web services (see the eFamilyService schema documentation for more details). As well as the domain boundary definitions, the schema also allows any associated sequence or structural alignments to be encapsulated in xml.
Below we describe the eFamily schema and the theory behind its design. The schema is complex, but hopefully if you are reading this document you already have some understanding of xml and schemas. If you are unsure of anything, look at w3schools or send a mail to us.
The eFamily schema imports the following two schemas:
- dataTypes, imported under the namespace data
- RDF, imported under the namespace rdf.
The eFamily schema includes the alignment schema.
How does the documentation work ? We will walk through the schema from the root element to the leaves of the schema tree. When an element is of simple type or of complex Type but unbranched, the element will be described as it occurs in its parent. However, when an element is of complex type and branched the element will be described in the parent, then the child element will become parents and described in detail under a separate heading. After walking through the schema, there will be a full view of the schema in the summary section. Finally, there are some links to examples from each member database from eFamily.
The Root: Entry element
Structure of the Entry element
<entry> <rdf:RDF>see below</rdf:RDF> <entryDetail dbSourcve="The information source" property="The property of the CDATA"> entryDetail information </entryDetail> <entity>see below</entity> <alignment>see below</alignment> </entry>
(required; once only) The root element is entry. This represents any database entry. As such, the attributes for this element define the database source.
(required) Must be in the standard xml date format, and is the date that the document was produced.
(optional, one or more).Allows additional information about the entry to be included.
The rdf:RDF element
The Resource Description Framework (RDF) is an imported schema that allows the encapsulation of metadata. It is beyond the scope of this documentation to describe the imported schema, but more information can be found out here.
The entity element
The entity element encapuslates one instance of a domain or in the case of mapping a PDB chain or UniProt sequence.
<entity type="definition type" entityId="A unique identifier for the entity in the document"> <entityDetail dbSource="the information source" property="the property of the CDATA"> entity information that falls outside the schema. </entityDetail> <segment>see below</segment> </entity>
(required, one or more). The entity element represent two very different classes of data. The first could be one or more domain definitions for a database entry. Alternatively, an entity can represent one or more chains in a PDB file.
(required). The type of the entity. For entries form the domain databases the type is domain. Note, a domain may be heterogenous in its composition (e.g. RNA and protein). The other types are protein, RNA and DNA. These are used for defining the chain types in a PDB file.
(required). This should be a unique identifier for the entity in the entry.
(optional, one or more).
The /entry/entity/segment element
The segment defines a continuous region of an entity. Two or more segments can be used to model discontinuous domains. In the mapping section where chains in PDB where they map to more than one UniProt sequence (e.g. Chimeras). Note, this does not reflect disordered PDB regions as they are continous but not observed. Problems may arise from PDB numbering system...... as a start Resnum of -1 to a end Resnum of 10 does not mean that there are 12 residues involved!! Thus, we strongly recommend that you cross map structural domains to MSD numbering.
<segment segId="a identification of the segment" start="co-ordinate system start" end="co-ordinate system end" > <listResidue>see below</listResidue> <listMapRegion&>see below</listMapRegion> <segmentDetail dbSource="the information source" property="the property that the CDATA refers to"> information about the segment </segmentDetail> </segment>
(required, one or more).
(required) Idenitifer for the segment. Should be unique to the list of segments within an entity.
(optional, one or more).
The /entry/entity/segment/listResidue element
A container for a set of residue elements
The /entry/entity/segment/listResidue/residue element
The residue element allows description/information and cross mappings to other databases about a single residue to be conveyed.
<residue dbResNum="The residue number" dbResMon="The residue name" > <crossRefDb dbSource="the database being cross referenced" dbVersion="The cross referenced database version" dbCoordSys="The cross referenced database co-ordinate system" dbAccessionId="The cross reference database identifier" dbResNum="cross referenced residue number" dbResName="cross referenced residue name" dbChainId="cross referenced chain id/"> <residueDetail dbSource="the infromation source" property="the property that the CDATA refers to"> information about the residue </residueDetail> </residue>
The residue element describes information about the residue.
Allows the defined residue to be cross referenced to another database.
The /entry/entity/segment/listMapRegion element
Allows a part of the segment to be mapped to another database.
<listMapRegion> <mapRegion start="The start point being mapped" end="The end point being mapped";> <db dbSource="the database being mapped referenced" dbVersion="The mapped database version" dbCoordSys="The mapped database co-ordinate system" dbAccessionId="The mapped database identifier" > dbChainId="The co-ordinate system chain id" start="The start point being mapped" end="The end point being mapped";> <dbDetail dbSource="the infromation source" property="the property that the CDATA refers to"> information about the mapping or mapped database </dbDetail> </db> </mapRegion> </listMapRegion>
(optional, once only when used). The element contains a list of regions from within the segment that map to other databases.
(required, one or more). Defines a region from within the segment that one wished to map to.
(required, once only). Provides the cross reference to the defined mapRegion.
(optional). This is used to specify a chain id when using a PDB based co-ordinate system. e.g. dbChainId="A". In the case of unlabelled chains, the standard representation expected is " ".
The /entry/alignment element
This part of the schema allows the modeling of alignments, whether they are structural or sequence based. The objects that are aligned should be defined in the entity section of the alignment. However, there is no cross validation built into the schema. Admittedly, there is some redundancy in this part of the schema as the alignment schema is imported into both the eFamily schema and the dasalignment schema. Although slightly different methods underly the way alignments are accessed, there is little point in reinventing the wheel to produce an alignment section. Also, once the code is written to export the alignments, it can be used to produce dasalignments or eFamily alignments.
Okay, we have walked through the schema element by element. Lets put the whole schema together. Click here to view the whole schema.