ChEMBL logo

ChEMBL

spacer
ChEMBL Statistics
  Loading Statistics...
spacer

Compounds Targets Assays Documents Cells Exact Match Activity Source Filter

Please enter a search term

Selected Source Assay Counts Activity Counts
Scientific Literature9305915384310 (38.55%)
Open TG-GATEs158199158199 (1.13%)
DrugMatrix113678350929 (2.51%)
TP-search Transporter Database35926765 (0.05%)
PubChem BioAssays27677432508 (53.21%)
FDA Approval Packages13861387 (0.01%)
Sanger Institute Genomics of Drug Sensitivity in Cancer71473169 (0.52%)
Millipore Kinase Screening46873944 (0.53%)
GSK Published Kinase Inhibitor Set456169451 (1.21%)
Drugs for Neglected Diseases Initiative (DNDi)23314452 (0.1%)
MMV Malaria Box13845158 (0.32%)
Open Source Malaria Screening22344 (0%)
WHO-TDR Malaria Screening165853 (0.04%)
St Jude Malaria Screening165456 (0.04%)
AstraZeneca Deposited Data1511687 (0.08%)
GSK Kinetoplastid Screening137235 (0.05%)
Deposited Supplementary Bioactivity Data134817 (0.03%)
Novartis Malaria Screening627888 (0.2%)
GSK Malaria Screening681198 (0.58%)
St Jude Leishmania Screening642105 (0.3%)
GSK Tuberculosis Screening51406 (0.01%)
Harvard Malaria Screening4111 (0%)
Gates Library compound collection269444 (0.5%)
Check/Uncheck All

ChEMBL FAQ

ChEMBL Interface Questions


ChEMBL Data Questions


ChEMBL Download Questions


General Questions


Schema Questions and SQL Examples


ChEMBL Interface Questions

ChEMBL Data Questions

  • General

    • How often do you update the data?

      The data is updated regularly, with releases approximately every 3-4 months.


    • Can you provide of list of all changes and additions you've made since the last release?

      We provide a list of all downgraded compounds with each release. However, it would be impossible to note all structural changes and/or changes to compound names etc. If there is something specific that you need to check, please contact us and we will be able to let you know what, if anything, has happened to that compound.


    • Are my queries stored and if so, are they routinely deleted?

      We do not store any queries that our users run. We do not analyse the nature of any queries or note structures that are processed through the website. In fact, we use https:// wherever possible to maintain the security of the searches. Additionally, the EBI has terms of use and these can be found at: http://www.ebi.ac.uk/Information/termsofuse.html. This will tell you what information is retained by the EBI, if any.


    • Why are there so many different types of Standard Units in the database?

      The published units are taken directly from the literature and we then attempt to standardise these to report as a standard type. This is an ongoing curation task, which was started with the most common units first.


    • What is the 'Confidence Score'?

      As part of the manual curation process applied to the data we assign a confidence score to the assay-to-target relationships represented in the database. The confidence score value reflects both the type of target assigned to a particular assay and the confidence that the target assigned is the correct target for that assay. The confidence scores range from 0, for as yet uncurated data entries, to 9, where a single protein target has been assigned a high degree of confidence. Assays assigned a non-molecular target type, for example a cell-line or an organism, receive a confidence score of 1, while assays with assigned protein targets receive a confidence score of at least 4.

      CONFIDENCE_SCORE	DESCRIPTION
      0			Default value - Target assignment has yet to be curated
      1			Target assigned is non-molecular
      3			Target assigned is molecular non-protein target
      4			Multiple homologous protein targets may be assigned
      5			Multiple direct protein targets may be assigned
      6			Homologous protein complex subunits assigned
      7			Direct protein complex subunits assigned
      8			Homologous single protein target assigned
      9			Direct single protein target assigned
      


    • Do you show which compounds and/or bioactivity data has been changed or deleted with each release of ChEMBL?

      As the data and the compounds are continually being curated, it is not possible to keep a track of these alterations or additions. However, if our users contact us personally, we are always happy to try and give them this type of information on individual data or compounds, wherever possible.


    • Is there a list of all the Activity Types in ChEMBL?

      No, we do not keep a list of activity types but we are happy to create such a list as and when a user would require it. This can also be found if the ChEMBL data is downloaded from the FTP site and installed into a database.


    • What literature coverage is there in ChEMBL?

      We have coverage of over 47 journal papers in ChEMBL, most notably Journal of Medicinal Chemistry (http://pubs.acs.org/toc/jmcmar/), Bioorganic and Medicinal Chemistry Letters (http://www.sciencedirect.com/science/journal/0960894X ) and Journal of Natural Products (http://pubs.acs.org/toc/jnprdf/). The ChEMBL data has been extracted from over 48,000 individual papers in total.


    • How are the SMILES and InChI created for ChEMBL?

      The canonical SMILES are calculated using Accelrys's Pipeline Pilot using an algorithm that belongs to Accelrys but it was derived from Daylight's algorithm (http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html). The standard InChI is calculated using the command line InChI generator and was developed by the InChI Trust (http://www.inchi-trust.org/inchi/) and is called up via the command line.


    • How is the LogP calculated?

      The LogP, and other compound properties are caluculated according to the following document: Property_Definitions


    • How can I be informed of updates in ChEMBL?

      You can sign up to our ChEMBL Announce mailing list (http://listserver.ebi.ac.uk/mailman/listinfo/chembl-announce), where you will be kept up to date with all new releases and changes to the database.


    • Which sub-set of the PubChem Bioassay data has been integrated into ChEMBL ?

      In PubChem, depositors may assign multiple result types to an assay. However, if an assay is deposited as a 'confirmatory' assay (defined as an assay where a range of SID concentrations have been tested, with a view to determining a measurement of potency), then one of the result types must be marked up as an 'Active Concentration' (AC) result type. Panel assays may contain many 'AC' result types, one per panel member. The AC result type is the calculated potency measurement from the data, and is typically an IC50, EC50, AC50, GI50 or Ki. In addition, the PubChem deposition process requires that each SID in an assay must be assigned a single 'Activity Summary', from a controlled vocabulary which includes 'inactive', 'active' and 'inconclusive'. Only assays containing 'AC' result types have been integrated into ChEMBL, and from these assays, only activity data and SIDs associated with 'AC' result types have been integrated. The 'Activity Summary' field in PubChem associated with each integrated activity is also captured and shown in the 'Activity Comment' field in ChEMBL. Panel assays are divided into separate assays in ChEMBL, one ChEMBL assay for each panel member. A number of additional assays which do not match the above criteria, have also been included in the PubChem integration. These have been chosen individually, on the basis that they have been specifically requested to be included by ChEMBL users. These are: AID1, AID3, AID5, AID7, AID9, AID11, AID13, AID15, AID17, AID19, AID21, AID23, AID25, AID27, AID29, AID31, AID33, AID35, AID37, AID39, AID41, AID43, AID45, AID47, AID49, AID51, AID53, AID55, AID57, AID59, AID61, AID63, AID65, AID67, AID69, AID71, AID73, AID75, AID77, AID79, AID81, AID83, AID85, AID87, AID89, AID91, AID93, AID95, AID97, AID99, AID101, AID103, AID105, AID107, AID109, AID111, AID113, AID115, AID117, AID119, AID121, AID123, AID125, AID127, AID129, AID131, AID133, AID135, AID137, AID139, AID141, AID143, AID145, AID1851 and AID493040. A number of assays have been excluded since they contain data already present in ChEMBL. These assays include all ChEMBL-derived assays, and AID1433 (A data set that has been manually added to ChEMBL). An automatic 'standardization' of SID structures downloaded from PubChem is carried out prior to integration (using in house protocols). Standard inchis are generated from the standardized mol files, and used to normalize with existing ChEMBL structures. SIDs matching exactly on standard inchi to existing ChEMBL structures are assigned to the existing CHEMBLID (and the mol file already associated with the existing ChEMBL structure is used to represent the searchable structure for this CHEMBLID). Where no match to a standard inchi is achieved, the incoming SID is assigned to a new CHEMBLID, and the standardized mol file for the SID is used to represent the searchable structure. A very small number of SIDs (<0.1%) with standardized mol files that fail to produce valid standard inchis, or to load into a oracle symyx cartridge without errors, are each assigned a new CHEMBLID, and associated with a 'null' structure (ie: no mol file is associated with this new CHEMBLID). Mappings to ChEMBL targets for each integrated PubChem assay has been automated for the initial load. However, manual review of these mappings by expert curators may result in ongoing changes. Users who prefer to exclude the integrated PubChem data (or any other integrated external data set) from their ChEMBL web-interface searches can do so by clicking 'Activity Source Filter' next to the main ChEMBL search bar, and deselecting the sources not required in future searches. Note, however, that these deselections persist between browser sessions. Users querying ChEMBL database dumps directly using SQL, and wishing to achieve this same filtering, should inspect the 'source' table, and the foreign keys to this table in the 'assays' and 'compound_records' tables.


    • What is the difference between ChEMBLdb, Malaria Data, ChEMBL-NTD and Kinase SARfari?

      ChEMBLdb is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). We attempt to normalise the bioactivities into a uniform set of end-points and units where possible, and also to tag the links between a molecular target and a published assay with a set of varying confidence levels. The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs.

      Malaria Data provides a service to access to hundreds of thousands of data points on malaria-related compounds, assays and targets, thus facilitating research for this neglected tropical disease. Inspired by the successful ChEMBL interface, a user may query the database using keywords, synonyms, chemical structures or protein sequences, review and filter the hits using tables or charts and then download the resulting subset.

      ChEMBL-NTD is a repository for Open Access primary screening and medicinal chemistry data directed at neglected diseases - endemic tropical diseases of the developing regions of the Africa, Asia, and the Americas. The primary purpose of ChEMBL-NTD is to provide a freely accessible and permanent archive and distribution centre for deposited data. ChEMBL-NTD is a subset of the data in the free medicinal chemistry and drug discovery database ChEMBLdb.

      Kinase SARfari is an integrated chemogenomics workbench focused on Kinases. The system incorporates and links Kinase sequence, structure, compounds and screening data. It is both biology and chemistry aware and provides a central resource for Protein Kinase knowledge.


    • What data sources are stored in ChEMBLdb?

      The majority of the data is made up of the PubChem data, with the curated data from primary literature coming in second highest. As ChEMBL is part of the eTox project we also have data from TG Gates, GRAC, DrugMatrix and TP-search Transporter Database. As part of the Neglected Tropical Disease portal, we have deposited data from GSK for malaria and tuberculosis, Novartis malaria data, St Jude's Children's hospital malaria data. We also have the GSK published Kinase Inhibitor set and deposited supplementary data.


    • Can you provide some details on how ChEMBL compound properties (e.g. LogP, MW, PSA,..) are calculated?

      Yes, please refer to the following document: Property_Definitions


    • What is pChEMBL?

      In addition to the conversion of published activity types/values/units to standard activity types/values/units, we have now added an additional field called pChEMBL to the Activities table. This value allows a number of roughly comparable measures of half-maximal response concentration/potency/affinity to be compared on a negative logarithmic scale. For example, an IC50 measurement of 1nM would have a pChEMBL value of 9. pChEMBL is defined as: -Log(molar IC50, XC50, EC50, AC50, Ki, Kd or Potency.


    • What is the Data Validity column?

      The Data Validity column has been added to the interface and to the database (as column data_validity_comment) to flag activity values that are outside a range typical for that activity type, or even for potentially missing data. For example, an IC50 of 23000000nM would be flagged as "Outside Typical Range". The use of the flag will allow users to decide whether or not to retain those values in their results set after running a search on the interface or using the downloaded database. The flags currently in use in this field are:

      • Potential missing data
      • Potential author error
      • Manually validated
      • Potential transcription error
      • Outside typical range
      • Non standard unit for type
      • Author confirmed error


    • Can you provide more details on the removal of Metal-Containing compounds?

      The molfiles and images of a proportion of metal-containing compounds we removed from the ChEMBL interface and downloads set in ChEMBL_17. This was partially due to some of these compounds having coordinated metal bonds. As InChI limitations are such that these coordinate bonds could not generate a Standard InChI, our main compound indicator of uniqueness in ChEMBL, it was decided to exclude the structures altogether.

      The compound image on the interface was replaced with an icon that shows it is a metal-containing compound and the molfiles were removed from the download set on the FTP site. We will retain the molecular formula in both the download files and on the ChEMBL interface, so that the elemental make up of the compound is visible. This change does not affect the storage or display of the associated bioactivity data for these compounds.


    • Can you provide more details on bioactivity data standardisation?

      In addition to the conversion of published activity types/values/units to standard activity types/values/units, described in previous releases, a number of further enhancements have been made to the data in the activities table:

      1. Conversion of log/-log values to nM concentrations. For example pIC50 and log Ki have been converted to IC50 and Ki values.
      2. Rounding of standard values to three significant figures (or 2 decimal places for values > 10)
      3. Flagging of data with possible errors (using the data_validity_comment field), such as unusual units for the activity type, or very large/small numbers.
      4. Identification of potential duplicate values - where an activity measurement is likely to be a repeat citation of an earlier measurement, rather than an independent measurement (flagged using the potential_duplicate column).

      An additional table (ACTIVITY_STDS_LOOKUP) has been included in this release, which contains details of the standard_types that have been standardised, their permitted standard_units, and acceptable value ranges.


    • Can you provide more details on how to detect potentially duplicated data?

      We detected and flagged duplicated activity entries and potential transcription errors in activity records that come from publications. The former are records with identical compound, target, activity, type and unit values that were most likely reported as citations of measurements from previous papers, even when these measurements were subsequently rounded. The latter cases consist of otherwise identical entries whose activity values differ by exactly 3 or 6 orders of magnitude indicating a likely error in the units (e.g. uM instead of nM).


ChEMBL Download Questions

General Questions

Schema Questions and SQL Examples

  • General

    • Can you show the relationship between all the tables in the ChEMBL database?

      A PNG of the schema relationships can be found in each release, the most recent being: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_15/chembl_15_erd.png


    • Where can I get more information about the new classifications of target types in ChEMBL_15?

      This can be found in the following slides (Current for ChEMBL_15): Target Types


    • Retrieve all the bioactivity data for bacterial targets:

      SELECT md.chembl_id AS compound_chembl_id, 
      cs.canonical_smiles,   
      act.standard_type,   
      act.standard_value,   
      act.standard_units,   
      td.chembl_id AS target_chembl_id,   
      td.organism,   td.pref_name 
      FROM target_dictionary td,   
      assays a,   
      activities act,   
      molecule_dictionary md,   
      organism_class oc,   
      compound_structures cs 
      WHERE td.tid      = a.tid 
      AND a.assay_id    = act.assay_id 
      AND act.molregno  = md.molregno 
      AND md.molregno   = cs.molregno 
      AND td.tax_id     = oc.tax_id 
      AND oc.L1         = 'Bacteria';
      


    • Retrieve all the compounds which have an IC50 bioactivity value in nM:

      SELECT DISTINCT md.molregno, 
      md.chembl_id,   
      act.standard_type,   
      act.standard_value, 
      act.standard_units 
      FROM activities act, 
      molecule_dictionary md 
      WHERE act.molregno     = MD.molregno 
      AND act.standard_type  = 'IC50' 
      AND act.standard_units = 'nM';
      


    • Retrieve compounds which are selective to one target over a second target:

      -- Compounds which are selective for Human CDK2 (CHEMBL301) over Human CDK5 (CHEMBL4036) 
      -- Selectivity is based on comparing binding affinities using IC50 values.  
      SELECT md.chembl_id FROM target_dictionary td, 
      assays a, 
      activities act, 
      molecule_dictionary md 
      WHERE td.tid             =a.tid 
      AND a.assay_id           =act.assay_id 
      AND md.molregno          =act.molregno 
      AND act.standard_relation='=' 
      AND act.standard_type   IN ('IC50') 
      AND act.standard_units   = 'nM' 
      AND act.standard_value   < 50 
      AND td.chembl_id         = 'CHEMBL301' 
      INTERSECT SELECT md.chembl_id 
      FROM target_dictionary td,   assays a,   activities act,   molecule_dictionary md 
      WHERE td.tid             = a.tid 
      AND a.assay_id           = act.assay_id 
      AND md.molregno          = act.molregno 
      AND act.standard_relation='=' 
      AND act.standard_type   IN ('IC50') 
      AND act.standard_units   = 'nM' 
      AND act.standard_value   > 200 
      AND td.chembl_id         = 'CHEMBL4036';
      


    • Retrieve target ChEMBL_ID, target_name, target_type, protein accessions and sequences for all protein targets:

      SELECT t.chembl_id AS target_chembl_id,   
      t.pref_name      AS target_name,   
      t.target_type,   
      c.accession      AS protein_accession,   
      c.sequence       AS protein_sequence 
      FROM target_dictionary t,   
      target_type tt,   
      target_components tc,   
      component_sequences c 
      WHERE t.target_type = tt.target_type 
      AND t.tid           = tc.tid 
      AND tc.component_id = c.component_id 
      AND tt.parent_type  = 'PROTEIN';
      


    • Retrieve target activity details for a compound and all salts:

      -- The compound is Atorvastatin (CHEMBL1487) 
      SELECT DISTINCT m1.chembl_id AS parent_chembl_id,   
      m2.chembl_id                 AS salt_chembl_id,   
      r.compound_key,   
      NVL(TO_CHAR(d.pubmed_id),d.doi) AS pubmed_id_or_doi,   
      a.description                   AS assay_description,   act.standard_type,   
      act.standard_relation,   
      act.standard_value,   
      act.standard_units,   
      act.activity_comment,   
      t.chembl_id AS target_chembl_id,   
      t.pref_name AS target_name,   
      t.target_type FROM molecule_hierarchy h,   
      molecule_dictionary m1,   
      molecule_dictionary m2,   
      compound_records r,   
      docs d,   
      activities act,   
      assays a,   
      target_dictionary t 
      WHERE m1.molregno = h.parent_molregno 
      AND h.molregno    = m2.molregno 
      AND m2.molregno   = r.molregno 
      AND r.record_id   = act.record_id 
      AND r.doc_id      = d.doc_id 
      AND act.assay_id  = a.assay_id 
      AND a.tid         = t.tid 
      AND m1.chembl_id  = 'CHEMBL1487';
      


    • Retrieve compound activity details for all targets containing a protein of interest:

      -- Protein of interest is human M2 muscarinic receptor (P08172) 
      SELECT DISTINCT m.chembl_id AS compound_chembl_id,   
      s.canonical_smiles,   
      r.compound_key,   
      NVL(TO_CHAR(d.pubmed_id),d.doi) AS pubmed_id_or_doi,   
      a.description                   AS assay_description,   act.standard_type,   
      act.standard_relation,   
      act.standard_value,   
      act.standard_units,   
      act.activity_comment,   
      t.chembl_id AS target_chembl_id,   
      t.pref_name AS target_name,   
      t.target_type FROM compound_structures s,   
      molecule_dictionary m,   
      compound_records r,   
      docs d,   
      activities act,   
      assays a,   
      target_dictionary t,   
      target_components tc,   
      component_sequences cs 
      WHERE s.molregno (+) = m.molregno 
      AND m.molregno       = r.molregno 
      AND r.record_id      = act.record_id 
      AND r.doc_id         = d.doc_id 
      AND act.assay_id     = a.assay_id 
      AND a.tid            = t.tid 
      AND t.tid            = tc.tid 
      AND tc.component_id  = cs.component_id 
      AND cs.accession     = 'P08172'; 
      


    • Retrieve compound activity details for a target:

      -- Target is Human PDE5 (CHEMBL1827) 
      SELECT m.chembl_id AS compound_chembl_id,   
      s.canonical_smiles,   
      r.compound_key,   
      NVL(TO_CHAR(d.pubmed_id),d.doi) AS pubmed_id_or_doi,   
      a.description                   AS assay_description,   act.standard_type,   
      act.standard_relation,   
      act.standard_value,   
      act.standard_units,   
      act.activity_comment 
      FROM compound_structures s,   
      molecule_dictionary m,   
      compound_records r,   
      docs d,   
      activities act,   
      assays a,   
      target_dictionary t 
      WHERE s.molregno (+) = m.molregno 
      AND m.molregno       = r.molregno 
      AND r.record_id      = act.record_id 
      AND r.doc_id         = d.doc_id 
      AND act.assay_id     = a.assay_id 
      AND a.tid            = t.tid 
      AND t.chembl_id      = 'CHEMBL1827';
      


    • Can you show me how to use SQL to only extract the PubChem data from ChEMBL?

      -- The source id for PubChem data is found in the SOURCE table and is ‘7’.
      -- Please note that this will bring back over 4,000,000 data points
      SELECT DISTINCT md.molregno, 
      md.chembl_id,   
      act.standard_type,   
      act.standard_value, 
      act.standard_units 
      FROM activities act, 
      molecule_dictionary md,
      compound_records cr,
      source src 
      WHERE act.molregno     = MD.molregno 
      AND cr.molregno = act.molregno
      AND src.src_id = cr.src_id
      AND src.src_id = ‘7’;
      


spacer
spacer