Chemical Entities of Biological Interest (ChEBI)
is a freely available dictionary of 'small molecular entities'.
The term 'molecular entity' encompasses any constitutionally or isotopically distinct atom, molecule,
ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately
distinguishable entity. The molecular entities in question are either products of nature or
synthetic products used to intervene in the processes of living organisms (either on purpose,
as for drugs, or by accident, as for chemicals in the environment). The qualifier 'small' implies the exclusion of
entities directly encoded by the genome, and thus as a rule nucleic acids, proteins and peptides derived from
proteins by cleavage are not included.
Classes of molecular entities and part-molecular entities (in the form of substituent groups
or atoms) are also included in ChEBI.
ChEBI employs nomenclature and terminology recommended by the following international bodies:
- International Union of Pure and Applied Chemistry (IUPAC)
- Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB)
In addition, ChEBI incorporates an ontological classification, whereby the relationships between compounds, groups or classes of compounds and their parents, children and/or siblings are specified.
All data in the database is non-proprietary or is derived from a non-proprietary source. It is thus freely accessible and available to anyone. In addition, each data item is fully traceable and explicitly referenced to the original source.
2. Data Fields
2.1 ChEBI ID
A unique and stable identifier for the entity, for example, CHEBI:16236. It has no chemical significance and may be cited by external users.
The name for an entity recommended for use by the biological community. In general traditional names have been retained by ChEBI but these may have been modified to enhance clarity, avoid ambiguity and follow more closely current IUPAC recommendations on chemical nomenclature.
For more information see the Annotation Manual.
The ChEBI Name is also provided in ASCII format if the original includes special characters which require a Unicode presentation.
A short verbal definition is included in some entries (and for all new entries annotated after June 2009). For more information see the Annotation Manual.
Wikipedia: In addition to a definition, for those compounds or classes for which ChEBI provides a database accession link to Wikipedia, the first paragraph of the Wikipedia entry is reproduced, along with a link to the full article.
Indicating the date that the entity was last modified by an annotator.
Entries which have been manually annotated by the ChEBI team are indicated by the presence of a '3-star' symbol. This is shown on the main display screen for an entity and on the search results page. An absence of a '3-star' symbol indicates that the entity has been manually annotated by a third party, or (occasionally) that it has been marked as deleted or obsolete. [Preliminary Entries – those loaded automatically from a data source but which have not been manually annotated – are not shown on the public website.]
Here are listed the IDs of any entries which may have been subsumed into the parent.
If an entry is present by virtue of its having been submitted via the ChEBI Submissions Tool, the name of the submitter is displayed here (unless the submitter has elected to remain anonymous).
ChEBI stores the two-dimensional or three-dimensional structural diagrams as connection tables in MDL molfile format. One entity can have one or more connection tables.
One or more structures may be displayed for an entity. Where there is more than one structure available, the additional ones may be viewed by clicking on the 'more structures' link beside the main displayed structure. By default, the diagrams are shown as the static PNG images generated by ChemAxon MarvinBeans, while clicking on 'Applet' will open an interactive MarvinView applet which allows the structure to be manipulated. Clicking on 'Image' restores the static image view. A link is provided beneath a structure to the corresponding MDL molfile.
For more information see the Annotation Manual.
The InChI is a non-proprietary identifier for chemical substances that can be used in printed and electronic data sources thus enabling easier linking of diverse data compilations. It expresses chemical structures in terms of atomic connectivity, tautomeric state, isotopes, stereochemistry and electronic charge in order to produce a sequence of machine-readable characters unique to the respective molecule. Further information on the InChI is available at http://www.iupac.org/inchi/.
A very useful 'Unofficial InChI FAQ' is also accessible at http://wwmm.ch.cam.ac.uk/inchifaq.
The InChIKey is a 25-character hashed version of the full InChI, designed to allow for easy web searches of chemical compounds. InChIKeys consist of 14 characters resulting from a hash of the connectivity information of the InChI, followed by a hyphen, followed by 8 characters resulting from a hash of the remaining layers of the InChI, followed by a single character indicating the version of InChI used, followed by single checksum character. There is a finite, but very small probability of finding two structures with the same InChIKey. However the probability for duplication of only the first block of 14 characters has been estimated as one duplication in 75 databases each containing one billion unique structures; such duplication therefore appears unlikely at present.
Further information on the InChIKey is available at http://old.iupac.org/inchi/release102.html.
SMILES (Simplified Molecular Input Line Entry System) is a simple but comprehensive chemical line notation, created in 1986 by David Weininger and further extended by Daylight Chemical Information Systems, Inc. SMILES specifically represents a valence model of a molecule and is widely used as a data exchange format.
Further information on SMILES is available at http://www.daylight.com/smiles/.
Where possible, formulae are assigned for entities and groups. For compounds consisting of discrete molecules, this is generally the molecular formula, a formula according with the relative molecular mass (or the structure). To facilitate searching and downloading of data from external sources, the use of subscripts to indicate multipliers is avoided.
The following conventions regarding ChEBI formulae are followed:
- Unless immediately following a dot '.' any numeral refers to the preceding element in the formula. Example: H2O really means there are two hydrogen atoms and one oxygen atom.
- The dot '.' convention is used when dividing a formula into parts. Any numeral following a dot refers to all the elements within that part of the formula that follow it. Example: C2H3O2.Na.3H2O (CHEBI:32138) really means that after C2H3O2 there is one sodium (Na), six hydrogen and three oxygen atoms.
- Parentheses are used within ChEBI formulae to mean multiplication of elements.
- The 'n' convention is used to show an unknown quantity by which a formula is multiplied. For example: (C12H20O11)n from CHEBI:15443 really means that a C12H20O11 unit is multiplied by an unknown quantity.
- A comma can be used to indicate that there is one or more of the elements divided by the comma but that the exact stoichiometry can vary. For instance, actinolite is a mineral with the chemical formula Ca2(Mg,Fe)5Si8O22(OH)2, which means that it could be anything in the continuous series between Ca2Mg5Si8O22(OH)2 and Ca2Fe5Si8O22(OH)2.
For more information see the Annotation Manual.
The charge is the sum of all the positive and negative charges shown in the structure. For ions the magnitude of the charge is given in arabic numerals preceded by the sign of the charge. For neutral molecules the charge is indicated as a numerical zero. For instance, the charge of 5,10,15,20-tetrakis(1-methylpyridinium-4-yl)porphyrin (CHEBI:37447) is +4; the charge of borate (CHEBI:22908) is -3.
Relative molecular, atomic and ionic masses are shown for molecular, atomic and ionic entities respectively. The relative masses are calculated from tables of relative atomic masses (atomic weights) published by IUPAC.
See Section 5 below.
A name provided for an entity based on current recommendations of IUPAC. It need not be fully systematic as it makes use of 'retained names'.
Example: The IUPAC Name for abietic acid (CHEBI:28987) is abieta-7,13-dien-18-oic acid, based on the retained name 'abietane', rather than the fully systematic name (1 R,4aR,10aR)-1,4a-dimethyl-7-(propan-2-yl)-1,2,3,4,4a,5,6,10,10a- decahydrophenanthrene-1-carboxylic acid (which is cited in ChEBI within the list of synonyms for this compound).
In most cases, a single IUPAC Name is provided for a molecular entity or a group. For organic compounds this name will, if necessary, be amended when the IUPAC rules for providing a 'Preferred IUPAC Name' for any organic compound are published.
For further information on IUPAC's preferred names project see the relevant web page: http://www.iupac.org/projects/2001/2001-043-1-800.html For more information see the Annotation Manual.
In cases where an entity is a pharmaceutical substance, an International Nonproprietary Name (INN) may be shown. The INN is the official non-proprietary or generic name given to a pharmaceutical substance, as designated by the World Health Organisation (WHO). INNs may appear in ChEBI in English, Latin, Spanish and French language versions.
Alternative names for an entity which either have been used in EBI or external sources or have been devised by the curators based on recommendations of IUPAC, NC-IUBMB or their associated bodies. The source of each synonym is clearly identified (see 'Data sources' below). Systematic names may also be included in this section. In addition to English-language synonyms, versions may be shown in French , German , Spanish and Latin , the language being indicated by a flag.
For more information see the Annotation Manual.
Synonyms are normally reproduced in the exact form in which they appear in the source. However, where changes have been made, e.g. to correct syntax or to convert from an index style of presentation, then this is indicated by .
Where an entity is an active ingredient of a proprietary pharmaceutical preparation, the brand name of the preparation may be shown.
Direct links to the entries for an entity in the databases cited.
The Chemical Abstracts Service (CAS) Registry Number is a unique numeric identifier assigned to a substance when it enters the CAS REGISTRY database. Registry Numbers have no chemical significance and are assigned in sequential order to unique, new substances identified by CAS scientists for inclusion in the database.
Two principles of ChEBI are that (1) nothing held in the database must be proprietary or derived from a proprietary source that would limit its free distribution and/or availability and (2) every data item in the database should be fully traceable and explicitly referenced to the original source. As such, it is impossible for ChEBI to cite CAS as a source for Registry Numbers as this organization's products are not freely accessible. ChEBI therefore cites other reliable and freely accessible sources for CAS Registry Numbers which are always fully referenced.
A free-text comment may be added to some terms especially in cases where confusing terminology has been historically used. A comment may relate to a single term or to the entry as a whole.
Publications which cite the entity are listed here, along with hyperlinks to the PubMed entry via CiteXplore, a web application of the EBI for the exploration of literature related to biological research and bioinformatics. Clicking on the 'Show Abstract' link displays the abstract as contained within CiteXplore.
For entries initiated via the ChEBI Submission tool, a record of any discussion had between the submitter and annotator.
Neither ChEBI nor the EBI stock or sell chemical entities. Supplier information displayed in this section provides links to the ZINC and/or the eMolecules databases of commercially available compounds. Note that these links are obtained by automatic matching of InChIKeys, so no Supplier Information will be shown for entities which do not have an associated structure in ChEBI.
3. Data sources
The Integrated relational Enzyme database of the EBI. IntEnz is the master copy of the Enzyme Nomenclature, the recommendations of the NC-IUBMB on the Nomenclature and Classification of Enzyme-Catalysed Reactions.
One part of the the Kyoto Encyclopedia of Genes and Genomes LIGAND composite database, COMPOUND is a collection of biochemical compound structures.
A database of approximately 1,500,000 bioactive compounds, their quantitive properties and bioactivities, abstracted from the primary scientific literature. It is part of the ChEMBL resources at the EBI.
These sources are manually entered into the database by a ChEBI curator.
Indicates entry initiated by a ChEBI curator.
A free, web-based search system, ChemIDplus provides access to structure and nomenclature authority files used for the identification of chemical substances cited in National Library of Medicine (NLM) databases.
Name based on the recommendations of IUPAC.
Name based on the recommendations of the IUPAC-IUBMB Joint Commission on Biochemical Nomenclature, a body jointly responsible to both IUBMB and IUPAC, which deals with matters of biochemical nomenclature that have importance in both biochemistry and chemistry.
Name based on the recommendations of the IUPAC-IUB Commission on Biochemical Nomenclature, the forerunner of JCBN, which was discontinued in 1977.
The National Institute of Standards and Technology operates a Chemistry WebBook providing access to chemical and physical property data for chemical species. The data provided are from collections maintained by the NIST Standard Reference Data Program and outside contributors.
The Protein Data Bank (PDB) is a repository for 3D structural data on biological macromolecules and their complexes. It is maintained by the Worldwide PDB (wwPDB; wwpdb.org) organisation. EMBL-EBI's Protein Data Bank in Europe (PDBe; pdbe.org) is one of the founding members of wwPDB.
The University of Minnesota Biocatalysis/Biodegradation Database maintains a list of compounds involved in microbial biocatalytic reactions and biodegradation pathways.
The RESID Database of Protein Modifications at the EBI is a comprehensive collection of annotations and structures for protein modifications including amino-terminal, carboxyl-terminal and peptide chain cross-link post-translational modifications.
COMe (Co-Ordination of Metals) at the EBI represents an ontology for bioinorganic and other small molecule centres in complex proteins, using a classification system based on the concept of a bioinorganic motif.
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. It is produced by the EBI in international collaboration with GenBank at the NCBI (National Centre for Biotechnology Information, USA) and DDBJ (DNA Data Bank of Japan).
The UniProt Knowledgebase is a central access point for extensive curated protein information, including function, classification, and cross-reference, created in 2002 by joining information contained in Swiss-Prot, TrEMBL, and PIR.
An online database of inorganic compounds, MolBase was constructed by Dr Mark Winter of the University of Sheffield with input from undergraduate students.
A part of the KEGG LIGAND database, GLYCAN is a collection of experimentally determined glycan structures.
Authored by Dr Mark Winter of the University of Sheffield, WebElements is a high-quality web-based source of chemistry information relating to the periodic table.
A comprehensive classification system for lipids developed by the Lipid Metabolites and Pathways Strategy (LIPID MAPS) consortium.
EuroFir (European Food Information Resource Network), the world-leading European Network of Excellence on Food Composition Databank systems, is a partnership between 48 universities, research institutes and small-to-medium sized enterprises (SMEs) from 25 European countries.
Links to patent documents which either cite the preparation, properties or uses of an entity, or are the source of a synonym, are provided via the esp@cenet service of the European Patent Office.
Developed at the University of Alberta, the DrugBank database is a bio- and chemo-informatics resource that combines detailed drug data with comprehensive drug target information.
The EBI Industry Programme is a forum through which the EBI can provide training and research of benefit to the European pharmaceutical, biotechnology, consumer-goods, chemical and agricultural industries. The membership comprises many of the world's leading pharmaceutical, biotechnology and consumer-goods companies.
Enhanced automatically generated cross-references to a number of external databases are provided on a separate viewing screen reached via a tab on the main results screen. At the time of writing, automatically generated cross-references are provided to the following databases:
UniProt (Universal Protein Resource) is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL and PIR. UniProtKB (UniProt Knowledgebase) is one component and is the central access point for extensive curated protein information, including function, classification, and cross-reference. The links from a ChEBI entry enable a user to view the UniProtKB entries for all proteins associated with that particular compound and are updated monthly.
A service of EMBL-EBI, IntAct provides a freely available, open source database system and analysis tools for protein interaction data. As for UniProt KB (see above), the links from a ChEBI entry enable a user to view the IntAct entries for all proteins associated with that particular compound.
BioModels Database is a data resource, developed by a consortium including EMBL-EBI and Caltech, that allows biologists to store, search and retrieve published mathematical models of biological interest. Models present in BioModels Database are annotated and linked to relevant data resources, such as publications, databases of compounds and pathways and controlled vocabularies.
The Reactome project is a curated resource of core pathways and reactions in human biology, developed as a collaboration among Cold Spring Harbor Laboratory, EMBL-EBI, and the Gene Ontology Consortium.
PubChem is a database maintained by the National Center for Biotechnology Information (NCBI). It contains substance descriptions and information on small molecules with fewer than 1000 atoms and 1000 bonds.
The SABIO-RK (System for the Analysis of Biochemical Pathways - Reaction Kinetics) is a database that contains information about biochemical reactions, the corresponding kinetic equations with their parameters, and the experimental conditions under which these parameters were measured.
Rhea, a collaboration between EMBL-EBI and the Swiss Institute of Bioinformatics (SIB), is a manually annotated database of chemical reactions in which all reaction participants (reactants and products) are linked to ChEBI. While its main focus is enzymatic reactions, other biochemical reactions are included.
IntEnz (Integrated relational Enzyme database) is a freely available resource focused on enzyme nomenclature. A collaboration between EMBL-EBI and the Swiss Institute iof Bioinformatics (SIB), it contains the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) on the nomenclature and classification of enzyme-catalysed reactions.
BRENDA (BRaunschweig ENzyme DAtabase) represents an information system containing a huge amount of biochemical and molecular information on all classified enzymes as well as software tools for querying the database and calculating molecular properties.
NMRShiftDB is a NMR database for organic structures and their nuclear magnetic resonance (nmr) spectra.
5. ChEBI Ontology
The ChEBI Ontology is a structured classification of the entities contained within ChEBI. Originally developed as 'Chemical Ontology' by Michael Ashburner and Pankaj Jaiswal, the initial alpha release was subsumed into ChEBI and is currently in process of being refined and extended. Its structure is essentially that of a directed acyclic graph (DAG), which differs from a simple taxonomy in that a child term can have many parent terms. Additionally, a number of relationships are incorporated which are cyclic in nature.
The ChEBI Ontology is subdivided into three separate sub-ontologies:
- Chemical Entity, in which molecular entities or parts thereof are classified according to composition and structure, e.g. hydrocarbons, carboxylic acids, tertiary amines;
- Role, divided into three sub-categories: 'chemical role' which classifies entities on the basis of their role within a chemical context, e.g. as ligand, inhibitor, surfactant; biological role which classifies entities on the basis of their role within a biological context, e.g. antibiotic, antiviral agent, coenzyme, hormone; and 'application' which classifies on the basis of their intended use by humans, e.g. pesticide, antirheumatic drug, fuel;
- Subatomic Particle, which classifies particles which are smaller than atoms, e.g. electron, photon, nucleon.
Two options for visualising the ontology relationships for an entry in ChEBI are provided:
The default view which states in words the relationships between a ChEBI entry and its immediate related entities.
A view, accessed via the link at the foot of the Outgoing and Incoming View, which by means of graphic illustration places a ChEBI entry into context within the ontology structure. All parents within the hierarchy are shown, as well as the immediate children. Adjacent is a key identifying the relationships used within the tree structure. Entries and relationships which have been checked by a curator are shown in blue while preliminary (unchecked) ones are in grey. Clicking on a node within the tree will take the user to the ChEBI entry for that node. Unchecked ChEBI entries accessed by this route will display the heading 'Preliminary ChEBI Entry'.
For each relationship a formal definition is included beneath the description.
5.3.1 is a
Implies that 'Entity A' is a subtype of 'Entity B'. E.g.
or, in words, chloroform (CHEBI:23143) is a subtype of the class of chloromethanes (CHEBI:23148), which means that all instances of chloroform are also instances of chloromethane. Chloromethanes is itself a subtype of the class of chloroalkanes (CHEBI:23143), and so forth.
Definition: "C is_a C' if and only if: given any c that instantiates C at a time t, c instantiates C' at t."
Used to indicate the relationship between part and whole. E.g.
or, in words, potassium tetracyanonickelate(2−) (CHEBI:30071) has part tetracyanonickelate(2−) (CHEBI:30025).
Definition: "C has_part C' if and only if: given any c that instantiates C at a time t, there is some c' such that c' instantiates C' at time t, and c has c' as a part at t."
Cyclic relationships used to connect acids with their conjugate bases. E.g.
Thus, the neutral pyruvic acid (CHEBI:32816) is the conjugate acid of the pyruvate anion (CHEBI:15361), while as a corollary pyruvate is the conjugate base of the acid.
Definition: "A is_conjugate_acid_of B if and only if, given any a, a instantiates A and has the disposition to be a Bronsted Acid, then there is some b, such that b instantiates B and has the disposition to be a Bronsted Base, such that b derives from a through the removal of a proton as the result of a chemical transformation process."
A cyclic relationship used to show the interrelationship between two tautomers, where the differences between the structures are significant enough to warrant their separate inclusion in ChEBI. E.g.
Thus,L-serine (CHEBI:17115) and its zwitterion (CHEBI:33384) are tautomers.
Definition: "A is_tautomer_of B if and only if, given any a which instantiates A and has composition ca and is described by a molecular graph ag, there is some b that instantiates B, has composition cb and is described by a molecular graph bg, such that ca equals cb, ag is different from bg and a derives from b as the result of an intramolecular chemical transformation process (i.e. a chemical transformation process which has only one participant), in which only bonds to hydrogen are broken or formed."
A cyclic relationship used in cases when two entities are mirror images of and non-superposable upon each other. E.g.
Each relationship shows that D-alanine (CHEBI:15570) is an enantiomer of L-alanine (CHEBI:16977) and vice versa.
Definition: "A is_enantiomer_of B if and only if, given any a that instantiates A, has molecular graph ag, there is some b such that b instantiates B, is described by molecular graph bg, such that ca is equal to cb and ag is transformed into bg through a C2 symmetric transform."
Used to denote the relationship between two molecular entities (or classes of entities), one of which possesses one or more chacteristic groups from which the other can be derived by functional modification. E.g.
Or, in words, 16α-hydroxyprogesterone (CHEBI:15826) can be derived by functional modification (i.e. 16α-hydroxylation) of progesterone (CHEBI:17026).
Definition: "A has_functional_parent B if and only if given any a, a instantiates A , has molecular graph ag and a obo:has_part some functional group fg, then there is some b such that b instantiates B, has molecular graph bg and has functional group fg’ such that bg is the result of a graph transformation process on ag resulting in the conversion of fg into fg'."
Denotes the relationship between an entity and its parent hydride (defined by IUPAC as "an unbranched acyclic or cyclic structure or an acyclic/cyclic structure having a semisystematic or trivial name to which only hydrogen atoms are attached"). E.g.
Thus 1,4-naphthoquinone (CHEBI:27418) has as its parent hydride the cyclic hydrocarbon naphthalene (CHEBI:16482).
Definition: "A has_parent_hydride B if and only if given any a, a instantiates A , has molecular graph ag and a obo:has_part some functional group fg, then there is some b such that b instantiates B, has molecular graph bg such that bg is the result of a graph transformation process on ag resulting in the removal of fg and its replacement by a hydrogen atom."
Indicates the relationship between a substituent group (or atom) and its parent molecular entity, from which it is formed by loss of one or more protons or simple groups such as hydroxy groups. E.g.
The L-valino group (CHEBI:32854) is derived by a proton loss from the N atom of L-valine (CHEBI:16414).
Definition: "A is_substituent_group_from B if and only if A is a group and B is a molecular entity; given any a that instantiates A, a has molecular graph ag and specified attachment point agap, and there is some b that instantiates B and has molecular graph bg, then it is the case that bg is the result of a graph transformation process on ag resulting in the replacement of agap by some group bgg (which may be a hydrogen atom or a more complex group)."
Indicates the particular behaviour which an entity may exhibit, either naturally or by human application. E.g.
Thus morphine (CHEBI:17303) has a role opioid analgesic (CHEBI:35482).
Definition: "Chemical entity C has_role role R if and only if: given any c that instantiates C at t, there exists some r that instantiates R at t, and c is the bearer of r at t."
The status of each entry and relationship shown within the denormalised tree view is indicated as follows:
Entries and relationships which have been checked by a curator are shown in blue in the tree view.
Entries and relationships which have not been checked by a curator are shown in grey in the tree view. Such entries and relationships must be regarded as preliminary. All unchecked entries accessed via the tree view carry a heading 'Preliminary ChEBI Entry'.
There are different types of classes in the ChEBI ontology.
Closed classes that include:
- Superentities (families of isomers):
- (CHEBI:16449) can contain two and only two stereosiomers: L-alanine (CHEBI:16977) and D-alanine (CHEBI:15570)
- phytoene (CHEBI:26119) theoretically can contain a large but limited number of geometric isomers (128), even though ChEBI only has two instances.
- cresol (CHEBI:25399) can have three structural isomers
- pyrrole (CHEBI:35556) can have three tautomers
e.g. methylbenzene (CHEBI:38975) includes three compounds: toluene (CHEBI:17578), pentamethylbenzene (CHEBI:38998) and hexamethylbenzene (CHEBI:39001), and three superentities: xylene (CHEBI:27338) (3 isomers), trimethylbenzene (CHEBI:38641) (3 isomers) and tetramethylbenzene (CHEBI:38977) (2 isomers)
As "methylbenzene" means benzene substituted with one or more methyl groups, the class is closed, i.e. limited in size.
Anything else is open class. For instance, toluenes (CHEBI:27024) includes toluene and various substituted toluenes, e.g. hydroxytoluenes (CHEBI:24751).
6. Developer's Reference
See the ChEBI Developer Manual for further information.
7.1 Searching the ChEBI database
The ChEBI search interface comprises two parts: a quick text search as well as an Advanced Search. Text searching in both the quick and Advanced searches employs Lucene, a full-featured text search engine library written entirely in Java, while the structure search facility of the Advanced Search uses the new chemical structure search algorithm OrChem, an Oracle chemistry plug-in using the Chemistry Development Kit (CDK)
7.1.1 Quick text search
A text search box is provided on the home page. This enables users to enter either a precise search term or one employing wild cards. The wild-card character is the asterisk (*). The search engine will then search for that term through all the fields within the ChEBI entries, and then list the results using a scoring mechanism, the compound with the highest score being listed first. In the table of results are shown for each result the structure (if one exists within the database), the ChEBI ID and Name, the Text Search Score and the '3-star' symbol (where appropriate – see Section 2.5). Clicking on the ChEBI ID takes the user direct to that entry, while hovering the cursor over the structure enlarges the structure.
7.1.2 Advanced Search
The chemical structure search algorithm OrChem allows substructure and similarity searching to be performed on an Oracle 11g database. It allows the user to search on groups and residues, as well as on complete molecular entities. OrChem works in combination with the JChemPaint applet, an editor and viewer included in CDK for 2D chemical structures, and converts chemical structures into fingerprints, each fingerprint representing the occurrence of a particular structural feature. It is important to remember that fingerprints have limitations: they are good at indicating that a particular structure feature is not present but they can only indicate a structure feature's presence with some probability.
Fingerprints are used to eliminate candidates for further examination in substructure searching. For molecule A to be a substructure of molecule B then all bits set in the fingerprint of molecule A should be present in molecule B. Once this initial screening is performed, the potential substructure candidates are subjected to a more rigorous inspection to determine whether molecule A is a substructure of molecule B.
To perform a substructure search in ChEBI draw your chemical structure using the MarvinSketch applet. Then select the 'Chemical Structure Search' option 'Substructure' and click 'Search'. If your substructure is found within the database the results will be displayed with relevant links to the entities found.
Similarity searching is performed by calculating the Tanimoto coefficient for each structure within the database against the query structure. The Tanimoto coefficient calculates how many structural features two chemical structures have in common based on the fingerprint described above. A Tanimoto score of 1.0 indicates that the two structures are very similar. However, as the fingerprints are calculated on a chemical structure path depth of eight it means that many structures will have similar fingerprints and very high similarity scores even though they might not be very structurally similar.
Identity searching is performed using the InChI as a chemical identifier.
Advanced text search
The text search facility of the Advanced Search allows users to search all the data or to filter a search by category (see below). Mass and charge can be searched within ranges: for example, one can search for all entities with a mass of between 150 and 300 atomic mass units. Furthermore, searches can be filtered by database: for example, one can search for entities used in the NMRShiftDB or PubChem databases.
As in the Quick search, the asterisk (*) is provided as the wildcard character. A wildcard character allows you to find compounds by typing in a partial name. The search engine will then try to find names matching the pattern you have specified using the wildcard character. You can place wildcards in any of the search options and in any of the search combinations, making this character very valuable in terms of searching.
Users also have the ability to filter on the ChEBI ontology. This functionality allows one to retrieve all the children of a specific entity based on the relationship given. For example, all cofactors (CHEBI:23357) can be retrieved by entering the term cofactors using the 'has role' relationship and this will retrieve not only its direct children such as pantothenic acids (CHEBI:25848) but also further entities in the graph related via an is a relationship such as NADPH (CHEBI:16474). It also allows retrieval of only those entities with chemical structures by ticking a specific checkbox.
All the above searches can be combined by using the logical operators AND, OR and BUT NOT, and there are options on the Results page for exporting the search results in either MDL SD file, tab delimited or XML format.
As mentioned above, users can also search by category. This option allows searches to be narrowed down by selecting from the categories provided, a summary of which is below:
- All this allows you to search all the categories.
- ChEBI Identifier – allows searching for specific ChEBI identifiers.
- ChEBI Name – will search only for ChEBI Names matching your search term.
- Definition – allows you to search within the Definition field.
- IUPAC Name – will search for IUPAC Names.
- All Names – will search for ChEBI Names, IUPAC Names and Synonyms.
- InChI/InChIKey – will search for InChIs and InChIKeys matching the search criteria.
- SMILES – will search for SMILES strings matching the search criteria.
- Cross references – allows searching for accession numbers from other linked sources.
- Registry Numbers – will search for CAS, Beilstein and Gmelin Registry Numbers matching the search criteria.
- Citations – will search within the Citations for matches with the search criteria.
- Formula – will search within the Formula field.
- Mass – will search within the Mass field.
Categories can be used within any combination of the logical operators described above.
7.2 RSS Feed
You can subscribe to the ChEBI RSS feed by downloading and installing a RSS Reader. Once you have downloaded the RSS Reader you can cut and paste the RSS Feed into your subscription toolbar and save it. Click on the RSS icon to subscribe to the RSS Feed.
Firefox users! You can subscribe to the ChEBI RSS feed by clicking on the RSS link on the top right corner of your address bar.
Once you have bookmarked the RSS feed you can view all the most up to date news via your bookmarks folder.
7.3 Browser Search Plugins
You can install the ChEBI search engine into your web browsers search box. ChEBI uses the the OpenSearch description document format which is supported by web browsers such as Internet Explorer 7 and Mozila Firefox .
Follow the steps as follows:
- Click on the search options tab found next to the search engine box in the top right hand corner of your browser.
- When the menu appears, navigate to the 'Add Search Providers' option.
- The 'ChEBI' option should appear, click on it to automatically install it as one of your browsers search engines.
Follow the steps as follows:
- Click on the search options tab found next to the search engine box.
- When the menu appears, navigate to the 'Add ChEBI' option.
- Click on it to automatically install it as one of your browsers search engines.