All data in the database and on the FTP server is non-proprietary or is derived from a non-proprietary source. It is thus freely accessible and available to anyone. In addition, each data item is fully traceable and explicitly referenced to the original source.
All the files which are available for download are based on the internal ChEBI data model.
Most of the files follow the ChEBI database architecture and the domain model.
A diagram of the ChEBI data model is shown in figure a.
The notation used is based on the UML
(Unified Modelling Language) which models the object orientated
nature of the data. What follows is a short description of what each data object does.
Figure a: UML notation of domain model used in the FTP downloads.
The Compound object is the main entry point into the data. It holds all the references to other data objects. In addition it stores the ChEBI recommended name and definition. The Compound object has an association with itself. An instance of a Compound object will sometimes have a reference to another Compound object instance. This occurs when we merge duplicate sets of Compound objects within the database. We will refer to this other Compound object instance as the parent compound and the original entry with the reference to the parent compound as the child compound. A parent compound may have more than one child compound but it may not be a child to another parent compound. The whole collection of compound objects, i.e. parent and children, form one compound entry or compound set. This mini-hierachy allows us to combine redundant entries which were automatically loaded in from our main sources KEGG Compound and IntEnz. It also ensures the maintenance of stable identifiers. We will look at an excerpt from the flat file compounds.tsv which is available from the FTP downloads. The example below illustrates the entry for water (CHEBI:15377). You will notice that it has four, children entries namely CHEBI:5585, CHEBI:27313, CHEBI:13352 and CHEBI:10743. This can be noted by seeing that the parent_id has the identifier of the parent compound, CHEBI:15377. All data which belong to the children compounds and to the parent compound belong to the entire entry, so for example a name belonging to the child compound, CHEBI:27313, will appear as part of the entire entry. Note that the name and definition of the parent compound contain the official ChEBI recommended name and optionally the definition. All this information in the children compounds is ignored and a null value is placed here.
Figure b: Excerpt from the compounds.tsv file available on the FTP server.
The compounds.tsv file also has a column which refers to the status of an entry. If the entry has been manually curated and verified by our curators then it will have a status column 'C' which equates to checked. Anything else has not been curated and we include them only for completeness sake of the ontology.
The DatabaseAccession object holds all the manually curated database links and registry numbers available in ChEBI. It has a composite aggregation to the Compound Object. This means that a DatabaseAccession object cannot exist on its own and needs to have a reference to a Compound object.
The CompoundName object holds all the synonyms and IUPAC names. It also has a composite aggregation to the Compound object. Note that in the downloadable files you will find names of type 'NAME' which do not refer to the ChEBI recommended name but to the name of the source of this compound entry if available. For example, the first name in the list of names within a KEGG Compound entry is created as the KEGG Compound name.
The ChemicalData object holds all the formulae, charge and mass. The ChemicalData has a composite aggregation to the Compound object.
The Comment object holds any comments about the Compound object. It also has a composite aggregation to the Compound object.
The Reference object contains all automatically generated links to other databases with valid ChEBI data. The Reference has a composite aggregation to the Compound object.
The Structure object contains all the chemical structures of the Compound object. The data includes MDL molfiles, InChI and SMILES. The structure object has a composite aggregation to the Compound object.
The DefaultStructure object contains a reference to the MDL molfile found in the Structure object. A Compound object may have zero or more Structure objects. If there are more than one MDL molfile Structure object instances then one is selected as the default chemical structure. The default chemical structure is selected as the structure which our curators believe best represents the entity. Note that the DefaultStructure object has a composite aggregation with the Structure object as it is dependent and part of the Structure object.
A reference to this object determines whether a structure has been automatically generated from a program or manually curated. By default no MDL molfiles can be found in the AutogenStructure object as they are all manually curated. SMILES and InChIs will be found here as they are automatically generated from the default structure. The AutogenStructure object has a composite aggregation to the Structure object as it cannot survive without a reference to a Structure object.
The OntologyModel describes all the ontologies stored in ChEBI.
The Vertice object links the compound entries to the ontology. Each Vertice object has a composite aggregation with the Compound object. A Compound object can have more than one reference to a Vertice object.
The Relation object stores all the directed relationships of the ontology. The Relation object has two composite aggregations with the Vertice objects. Each aggregation signifies the beginning and the end of a relationship between two vertices.
Below is a list of example queries which will hopefully help developers understand the data model.
An example query could be to retrieve all the synonyms aswell as the ChEBI names within the database. In order to get the full set of names you need to query for names of all the sets of compounds linked together via that parent_id. The actual ChEBI Recommended Name always appears in the compounds table where the parent_id is null.
select distinct chebi_id, name from ( select nvl(c.parent_id,c.id) chebi_id, n.name name from names n, compounds c where n.compound_id = c.id and c.status='C' UNION select c.id chebi_id, c.name name from compounds c where c.name is not null and c.status='C')
An example query could be to retrieve all the ChEBI identifier and the ChEBI name for a specific KEGG Compound xref. Remember that this can be more than one result as we sometimes have a one to many mapping of KEGG Compound xref.
As explained in section 2.1.1 of the developer manual. Compounds are merged into sets when they are redundant and in order to query across the whole Compound set you need to check whether the KEGG Compound xref is linked directly the parent of the set or to one of the participants in the set.
Here is an example query for KEGG Compound xref C05279:
select distinct p.chebi_accession, p.name from database_accession d, compounds c, compounds p where p.status='C' and d.type='KEGG COMPOUND accession' and d.accession_number='C05279' and ((d.compound_id = c.id and c.parent_id=p.id) or (d.compound_id = p.id and p.parent_id is null))
3. Download Formats
The ChEBI data is provided in four different formats and can be downloaded from the ChEBI FTP server.
ChEBI is stored in a relational database and we currently provide the ChEBI tables in a flat-file tab delimited format. Files are provided as both 3-star and all-star files. 3-star files are suffixed with '_3star' before the file extension. There are various spreadsheet tools available to import this into a relational database. The files are stored in the same structure as the relational database.
3.2 Oracle binary table dumps
ChEBI provides two Oracle binary table dumps, 3-star and all-star entries, that can be imported into an Oracle
relational database. You can import this into Oracle using the 'imp' command.
The parameter file import.par or import_all_star.par should reside in the same directory when the import is done. The correct
command to execute is:
imp database_name/database_password@Instance_name PARFILE=import.par
imp database_name/database_password@Instance_name PARFILE=import_all_star.par
Note that ChEBI uses Oracle 9i version 126.96.36.199.0 and the export is performed in US7ASCII character set and AL16UTF16 NCHAR character set.
ChEBI provides the ChEBI ontology in OBO format version 1.2. More information about the OBO format can be found on the OBO website or the Gene Ontology website. The tool OBO-edit can be used to view the OBO file.
ChEBI provides a generic SQL dump which consists of SQL insert statements. The archive files are called generic_dump_3star.zip and generic_dump_allstar.zip consists of 12 files each which contain SQL table insert statements of the entire database. The file called compounds.sql should always be inserted first in order to avoid any constraint errors. Included in the folder are a MySQL and PostgreSQL scripts for creating the tables in the user's database.. These insert statements should be usable in any database which accepts SQL as its query language.
ChEBI provides its chemical structures and additional data in SDF format version 2000.
The ChEBI SDF is a customised version of the official SDF format.
The main deviations from the standard SDF file format are as follows.
- Each data item may be longer than 80 characters and has no maximum limit.
- Each line after the Data Header is a separate data item. For example, each new line in the synonyms is a separate synonym.
This file contains the chemical structure as well as three additional tags, namely the ChEBI identifier, the ChEBI Name and the ChEBI star rating.
Below is an example file for the entry water.
Marvin 02220718252D 3 2 0 0 0 0 999 V2000 -0.4125 0.7145 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -0.4125 -0.7145 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 2 3 1 0 0 0 0 M END > <ChEBI ID> CHEBI:15377 > <ChEBI Name> water > <Star> 3
This file contains the chemical structure as well as additional tags describing the ChEBI data items. A list of the data items described are listed below:
- ChEBI ID - is the main identifier of the entry and is always present.
- ChEBI Name - is the unambiguous name and is always present.
- Star rating - is the star rating and is always present.
- Definition - is a textual description of the entry and may sometimes be present.
- Secondary ChEBI ID - where various entries have been merged to avoid duplication, their secondary identifiers (if there are any) are listed here.
- SMILES - may sometimes be present.
- InChI - may sometimes be present.
- InChIKey - may sometimes be present.
- Charge - may sometimes be present.
- Mass - may sometimes be present.
- Formulae - may sometimes be present.
- IUPAC Names - may sometimes be present.
- Synonyms - may sometimes be present.
- BRAND Names - may sometimes be present.
- INN - may sometimes be present.
- Registry Numbers - ordered alphabetically and may sometimes be present.
- Beilstein Registry Numbers
- CAS Registry Numbers
- Gmelin Registry Numbers
- Database Links - these are ordered alphabetically and may sometimes be present for certain databases. The tag is always created as the database name followed by " Database Links". For example, ArrayExpress Database Links, BioModels Database Links, Patent Database Links. For a full list of database links in ChEBI please refer to the Data Sources and Automatically generated cross-references. Note that as new cross-references are added the ordering might change.
- Last Modified - always present.
- Submitter Name - if the entry is a submission then this will be indicated.
Below is an example file for the entry water. Please note that the actual data is not updated therefore it might be out of date.
Marvin 02220718252D 3 2 0 0 0 0 999 V2000 -0.4125 0.7145 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -0.4125 -0.7145 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 2 3 1 0 0 0 0 M END > <ChEBI ID> CHEBI:15377 > <ChEBI Name> water > <Star> 3 > <Secondary ChEBI ID> CHEBI:10743 CHEBI:13352 ... CHEBI:44819 > <SMILES> [H]O[H] > <InChI> InChI=1/H2O/h1H2 > <InChIKey> InChIKey=XLYOFNOQVPJJNP-UHFFFAOYAF > <Formulae> H2O > <Charge> 0 > <Mass> 18.01528 > <IUPAC Names> oxidane water > <Synonyms> H(2)O H2O HOH .... eau > <Beilstein Registry Numbers> 3587155 > <CAS Registry Numbers> 7732-18-5 > <Gmelin Registry Numbers> 117 > <ArrayExpress Database Links> E-TOXM-12 E-TOXM-14 > <BioModels Database Links> BIOMD0000000090 > <IntEnz Database Links> EC 188.8.131.52 > <IntEnz Database Links> EC 184.108.40.206 ... EC 220.127.116.11 > <KEGG COMPOUND Database Links> C00001 > <MolBase Database Links> 1 > <PDBeChem Database Links> HOH > <Patent Database Links> EP0769531 ... WO2008157552 > <PubChem Database Links> 8145132 > <Reactome Database Links> REACT_10000 ... REACT_9996 > <Rhea Database Links> RHEA:10000 ... RHEA:26115 > <SABIO-RK Database Links> 1000 ... 15333 > <UniProt Database Links> 1A1D_ENTCL ... RHEA:26115 > <Last Modified> 03 July 2008
For more information on how to use the ChEBI Web Services please refer to the ChEBI Web Services documentation.