ChEMBL logo

ChEMBL

spacer
ChEMBL Statistics
  Loading Statistics...
spacer

Compounds Targets Assays Documents Cells Tissues Exact Match Activity Source Filter

Please enter a search term

Selected Source Assay Counts Activity Counts
Scientific Literature10417336043592 (39.74%)
Patent Bioactivity Data425874050 (0.49%)
BindingDB Database4147303295 (1.99%)
TP-search Transporter Database35926765 (0.04%)
PubChem BioAssays28607434992 (48.89%)
FDA Approval Packages13861387 (0.01%)
Sanger Institute Genomics of Drug Sensitivity in Cancer71473169 (0.48%)
GSK Published Kinase Inhibitor Set456169451 (1.11%)
Drugs for Neglected Diseases Initiative (DNDi)23314452 (0.1%)
MMV Malaria Box13845158 (0.3%)
Curated Drug Pharmacokinetic Data1361163 (0.01%)
DrugMatrix134494046 (3.25%)
MMV Pathogen Box844657 (0.03%)
K4DD Project462064 (0.01%)
Open Source Malaria Screening22344 (0%)
St Jude Malaria Screening165456 (0.04%)
WHO-TDR Malaria Screening165853 (0.04%)
GSK Tuberculosis Screening151814 (0.01%)
CO-ADD antimicrobial screening data15180 (0%)
AstraZeneca Deposited Data1511687 (0.08%)
GSK Kinetoplastid Screening137235 (0.05%)
Deposited Supplementary Bioactivity Data134817 (0.03%)
Curated Drug Metabolism Pathways1111 (0%)
St Jude Leishmania Screening642105 (0.28%)
Novartis Malaria Screening627888 (0.18%)
GSK Malaria Screening681198 (0.53%)
Harvard Malaria Screening4111 (0%)
Gates Library compound collection4139775 (0.92%)
Open TG-GATEs2210708 (1.39%)
Published Kinase Inhibitor Set 21491 (0%)
Check/Uncheck All

ChEMBL FAQ

ChEMBL Interface Questions


ChEMBL Data Questions


ChEMBL Download Questions


General Questions


Schema Questions and SQL Examples


ChEMBL Interface Questions

  • General

    • How can you run a compound/ BLAST search?

      To run a compound (ligand) search, you can use the drawing window, found at https://www.ebi.ac.uk/chembl and draw in the exact structure or substructure of the compound you want to search on (1). Additionally, you can use the 'Search ChEMBLdb....' field to search for compound names (2).

      FAQ

      To run a BLAST search you can go to https://www.ebi.ac.uk/chembldb/target and paste in a FASTA file of your target of interest. Alternatively, you can paste ChEMBLIDs of your targets into this box and run a search.

      FAQ


    • How do you filter on the bioactivity results?

      Once you have run your search, whether it be a target or ligand search, you can go to the top right hand section of the results page and use the drop down box to choose the option to 'Filter Bioactivities'.

      FAQ

      Once this option has been chosen, you will be taken to the next page where you can choose the activity endpoints to filter the data on.

      FAQ


    • Can you search on a chemical molecular formula?

      At the moment, you are unable to search on the interface using molecular formula.


    • Can you search on a compound name, rather than a target name?

      Yes, you can choose the Keywords field for a single search term, which can be found at the top of all of the ChEMBLdb pages or, for multiple compound name searched, you can use the field on the Search Compounds page. Both will search across any of the data fields that may contain names of compounds, including Synonyms and Compound Name.


    • Once your search has returned results, can you click on one of those compounds and run it as a substructure?

      Yes you can do this. Once you have run your search, click on the structure itself. This will open up the drawing pop-up window which has a 'Use As Query' button. Once selected this will put this structure under the Compound Search tab automatically.


    • Can you input SMILES strings for compound searching?

      Yes, you can now use a SMILES string to run a compound search in ChEMBLdb. This feature can be found under the Ligand Search tab on the home page. You can also use multiple SMILES strings at once, as long as each string is on a new line.


    • How do I run an exact match for a compound?

      To run an exact match, choose the Ligand Search tab, found on the main home page. Use the embedded drawing window to input your structure of interest. You can then choose 'Similarity' from the search options dropdown box and '100%' from the 'Similarity Cut-off'. This will allow you to search for that compound specifically.


    • What is the maximum size of data file that you can download from the interface?

      We don't have a specific size restriction, but a generally rule is that anything with over 20,000 results with not be downloaded from the interface. The file is too big to be created.


    • Why don't I see my target of interest?

      We only store information for targets that we have data on - i.e. if we haven't extracted the target information from a journal, then it will not be stored in ChEMBLdb. However, it is possible to find similar targets using a Blast search and a FASTA sequence. A FASTA sequence can be copied from the Uniprot website (< a href=https://www.uniprot.org>https://www.uniprot.org) and pasted into the protein sequence search box - https://www.ebi.ac.uk/chembldb/target


    • Can you just search for known drugs?

      Yes you can. You can go to the 'Browse Drugs' tab on the home page and extract the bioactivity data for only the known drug. You can also search the compounds with criteria such as 'Natural Product' or using USAN names.


    • How do you download all the data from ChEMBL and its associated databases?

      All downloads can be found on our FTP site: https://www.ebi.ac.uk/chembldb/downloads


    • Can I search on your interface using an SDFile of compounds?

      This is not an available option at the moment. However, it has been added to the list of potential new features to be added, but there is no projected date for its completion.


    • How do I download a subset of biological data for a particular target?

      If you search on the interface for a particular target and then you want to download all the biological data related to this, you can do so by doing the following:

      FAQ

      In the Target Report Card, scroll to the Target Associated Bioactivities and click on the small grey square to the lower right of the pie chart (highlighted above). Choose 'Display All Records', which will take you to the page, which will show all the biological data relating to this target. In the top right hand corner of the page, the drop down box will allow you to choose to download all of the data as a tab-delimited or XLS file.

      FAQ


    • Can I have multiple tabs open to the ChEMBL interface and run concurrent searches?

      No, this is not possible as they are running off the same servers. It is best to run one search at a time.


    • Why do I not see all the salt forms in my compound result set on the interface?

      Salts were removed from the compounds results set to prevent confusion and to reduce the amount of compounds brought back from a search. However, if you search on a salt form of a compound, for example, Sildenafil Citrate, you will still see its parent, i.e. Sildenafil, in the compound list. If you then choose to display all the bioactivities, you will then see the individual salts relating to the activities.


    • How do you see more columns in the Bioactivity Search Results?

      All of the columns are exported by default in the text files, but not all are shown on the interface. More columns can be added to the search results window by clicking on 'Show/hide columns', which can be located to the top right hand corner of the results list.

      FAQ


    • How can I download all the bioactivity data for a particular target, for example IRAK4 (Interleukin-1 receptor-associated kinase 4)?

      Type 'IRAK4' into the ChEMBL search field and click on 'Targets'. Once in the results window, go to the top right hand corner and choose 'Display Bioactivities' from the drop-down box. In the bioactivity results window, to download the data as an XLS file, go to the top right hand corner drop down box and choose 'Download All Data (XLS)'


    • Can you search for all the compounds tested against HIV Protease 1?

      Type 'Human immunodeficiency virus type 1 protease' into the ChEMBL search field and click on 'Targets'. Click on the hyperlink for the ChEMBL ID of the target, in this case, ChEMBL243. This will take you to the Target Report Card. Once in the Target Report Card, scroll down to 'Target Associated Compound Properties' and click 'Display All Records' from the lower left hand grey box to access all of the compounds for this target. In the compound results window, to download all of the compounds as an SDFile, go to the top right hand corner drop-down box and choose 'Download 'SDF''.


    • How do I only search on the GSK PKIS (published kinase inhibitor set) data from ChEMBL?

      To only search using this data source, you must deselect all the other sources from the list. To do this, click on 'Activity Source Filter' and deselect all the choices other than the GSK PKIS data set from the list. Subsequent searches will then only use this single data source to bring back results.


ChEMBL Data Questions

  • General

    • How often do you update the data?

      The data is updated regularly, with releases approximately every 3-4 months.


    • Can you provide of list of all changes and additions you've made since the last release?

      We provide a list of all downgraded compounds with each release. However, it would be impossible to note all structural changes and/or changes to compound names etc. If there is something specific that you need to check, please contact us and we will be able to let you know what, if anything, has happened to that compound.


    • Are my queries stored and if so, are they routinely deleted?

      We do not store any queries that our users run. We do not analyse the nature of any queries or note structures that are processed through the website. In fact, we use https:// wherever possible to maintain the security of the searches. Additionally, the EBI has terms of use and these can be found at: http://www.ebi.ac.uk/Information/termsofuse.html. This will tell you what information is retained by the EBI, if any.


    • Why are there so many different types of Standard Units in the database?

      The published units are taken directly from the literature and we then attempt to standardise these to report as a standard type. This is an ongoing curation task, which was started with the most common units first.


    • What is the 'Confidence Score'?

      As part of the manual curation process applied to the data we assign a confidence score to the assay-to-target relationships represented in the database. The confidence score value reflects both the type of target assigned to a particular assay and the confidence that the target assigned is the correct target for that assay. The confidence scores range from 0, for as yet uncurated data entries, to 9, where a single protein target has been assigned a high degree of confidence. Assays assigned a non-molecular target type, for example a cell-line or an organism, receive a confidence score of 1, while assays with assigned protein targets receive a confidence score of at least 4.

      CONFIDENCE_SCORE	DESCRIPTION
      0			Default value - Target assignment has yet to be curated
      1			Target assigned is non-molecular
      3			Target assigned is molecular non-protein target
      4			Multiple homologous protein targets may be assigned
      5			Multiple direct protein targets may be assigned
      6			Homologous protein complex subunits assigned
      7			Direct protein complex subunits assigned
      8			Homologous single protein target assigned
      9			Direct single protein target assigned
      


    • Do you show which compounds and/or bioactivity data has been changed or deleted with each release of ChEMBL?

      As the data and the compounds are continually being curated, it is not possible to keep a track of these alterations or additions. However, if our users contact us personally, we are always happy to try and give them this type of information on individual data or compounds, wherever possible.


    • Is there a list of all the Activity Types in ChEMBL?

      No, we do not keep a list of activity types but we are happy to create such a list as and when a user would require it. This can also be found if the ChEMBL data is downloaded from the FTP site and installed into a database.


    • What literature coverage is there in ChEMBL?

      Data are routinely extracted from seven core journals:


      However, we also have data from selected articles in more than 200 journals including Antimicrob Agents Chemother, Med Chem Res and Drug Metab Dispos. ChEMBL currently contains data from more than 65,000 journal articles, and we also now include data from selected patents.


    • How are the SMILES and InChI created for ChEMBL?

      The canonical SMILES are calculated using Accelrys's Pipeline Pilot using an algorithm that belongs to Accelrys but it was derived from Daylight's algorithm (http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html). The standard InChI is calculated using the command line InChI generator and was developed by the InChI Trust (http://www.inchi-trust.org/inchi/) and is executed via the command line. The version of InChI used in ChEMBL is 1.05


    • How is the LogP calculated?

      The LogP, and other compound properties are calculated according to the following document: Property_Definitions. Please note, we are now using RDKit to calculate many of these properties.


    • How can I be informed of updates in ChEMBL?

      You can sign up to our ChEMBL Announce mailing list (http://listserver.ebi.ac.uk/mailman/listinfo/chembl-announce), where you will be kept up to date with all new releases and changes to the database.


    • Which sub-set of the PubChem Bioassay data has been integrated into ChEMBL ?

      In PubChem, depositors may assign multiple result types to an assay. However, if an assay is deposited as a 'confirmatory' assay (defined as an assay where a range of SID concentrations have been tested, with a view to determining a measurement of potency), then one of the result types must be marked up as an 'Active Concentration' (AC) result type. Panel assays may contain many 'AC' result types, one per panel member. The AC result type is the calculated potency measurement from the data, and is typically an IC50, EC50, AC50, GI50 or Ki. In addition, the PubChem deposition process requires that each SID in an assay must be assigned a single 'Activity Summary', from a controlled vocabulary which includes 'inactive', 'active' and 'inconclusive'. Only assays containing 'AC' result types have been integrated into ChEMBL, and from these assays, only activity data and SIDs associated with 'AC' result types have been integrated. The 'Activity Summary' field in PubChem associated with each integrated activity is also captured and shown in the 'Activity Comment' field in ChEMBL. Panel assays are divided into separate assays in ChEMBL, one ChEMBL assay for each panel member. A number of additional assays which do not match the above criteria, have also been included in the PubChem integration. These have been chosen individually, on the basis that they have been specifically requested to be included by ChEMBL users. These are: AID1, AID3, AID5, AID7, AID9, AID11, AID13, AID15, AID17, AID19, AID21, AID23, AID25, AID27, AID29, AID31, AID33, AID35, AID37, AID39, AID41, AID43, AID45, AID47, AID49, AID51, AID53, AID55, AID57, AID59, AID61, AID63, AID65, AID67, AID69, AID71, AID73, AID75, AID77, AID79, AID81, AID83, AID85, AID87, AID89, AID91, AID93, AID95, AID97, AID99, AID101, AID103, AID105, AID107, AID109, AID111, AID113, AID115, AID117, AID119, AID121, AID123, AID125, AID127, AID129, AID131, AID133, AID135, AID137, AID139, AID141, AID143, AID145, AID1851 and AID493040. A number of assays have been excluded since they contain data already present in ChEMBL. These assays include all ChEMBL-derived assays, and AID1433 (A data set that has been manually added to ChEMBL). An automatic 'standardization' of SID structures downloaded from PubChem is carried out prior to integration (using in house protocols). Standard inchis are generated from the standardized mol files, and used to normalize with existing ChEMBL structures. SIDs matching exactly on standard inchi to existing ChEMBL structures are assigned to the existing CHEMBLID (and the mol file already associated with the existing ChEMBL structure is used to represent the searchable structure for this CHEMBLID). Where no match to a standard inchi is achieved, the incoming SID is assigned to a new CHEMBLID, and the standardized mol file for the SID is used to represent the searchable structure. A very small number of SIDs (<0.1%) with standardized mol files that fail to produce valid standard inchis, or to load into a oracle symyx cartridge without errors, are each assigned a new CHEMBLID, and associated with a 'null' structure (ie: no mol file is associated with this new CHEMBLID). Mappings to ChEMBL targets for each integrated PubChem assay has been automated for the initial load. However, manual review of these mappings by expert curators may result in ongoing changes. Users who prefer to exclude the integrated PubChem data (or any other integrated external data set) from their ChEMBL web-interface searches can do so by clicking 'Activity Source Filter' next to the main ChEMBL search bar, and deselecting the sources not required in future searches. Note, however, that these deselections persist between browser sessions. Users querying ChEMBL database dumps directly using SQL, and wishing to achieve this same filtering, should inspect the 'source' table, and the foreign keys to this table in the 'assays' and 'compound_records' tables.


    • What is the difference between ChEMBLdb, Malaria Data, ChEMBL-NTD and Kinase SARfari?

      ChEMBLdb is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). We attempt to normalise the bioactivities into a uniform set of end-points and units where possible, and also to tag the links between a molecular target and a published assay with a set of varying confidence levels. The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs.

      Malaria Data provides a service to access to hundreds of thousands of data points on malaria-related compounds, assays and targets, thus facilitating research for this neglected tropical disease. Inspired by the successful ChEMBL interface, a user may query the database using keywords, synonyms, chemical structures or protein sequences, review and filter the hits using tables or charts and then download the resulting subset.

      ChEMBL-NTD is a repository for Open Access primary screening and medicinal chemistry data directed at neglected diseases - endemic tropical diseases of the developing regions of the Africa, Asia, and the Americas. The primary purpose of ChEMBL-NTD is to provide a freely accessible and permanent archive and distribution centre for deposited data. ChEMBL-NTD is a subset of the data in the free medicinal chemistry and drug discovery database ChEMBLdb.

      Kinase SARfari is an integrated chemogenomics workbench focused on Kinases. The system incorporates and links Kinase sequence, structure, compounds and screening data. It is both biology and chemistry aware and provides a central resource for Protein Kinase knowledge.


    • What data sources are stored in ChEMBLdb?

      ChEMBL consists of data from a wide variety of data sources including scientific literature and patents, deposited data sets, PubChem BioAssay and BindingDB databases, toxicology data sets and drug/clinical candidate resources. The current list of sources is:

      1 Scientific Literature
      2 GSK Malaria Screening
      3 Novartis Malaria Screening
      4 St Jude Malaria Screening
      5 Sanger Institute Genomics of Drug Sensitivity in Cancer
      7 PubChem BioAssays
      8 Clinical Candidates
      9 Orange Book
      11 Open TG-GATEs
      12 Manually Added Drugs
      13 USP Dictionary of USAN and International Drug Names
      14 Drugs for Neglected Diseases Initiative (DNDi)
      15 DrugMatrix
      16 GSK Published Kinase Inhibitor Set
      17 MMV Malaria Box
      18 TP-search Transporter Database
      19 Harvard Malaria Screening
      20 WHO-TDR Malaria Screening
      21 Deposited Supplementary Bioactivity Data
      22 GSK Tuberculosis Screening
      23 Open Source Malaria Screening
      25 External Project Compounds
      26 Gene Expression Atlas Compounds
      27 AstraZeneca Deposited Data
      28 FDA Approval Packages
      29 GSK Kinetoplastid Screening
      30 K4DD Project
      31 Curated Drug Metabolism Pathways
      32 St Jude Leishmania Screening
      33 Gates Library compound collection
      34 MMV Pathogen Box
      35 HeCaToS Compounds
      36 Withdrawn Drugs
      37 BindingDB Database
      38 Patent Bioactivity Data
      39 Curated Drug Pharmacokinetic Data
      40 CO-ADD antimicrobial screening data
      41 WHO Anatomical Therapeutic Chemical Classification
      42 British National Formulary
      43 Published Kinase Inhibitor Set 2


    • Can you provide some details on how ChEMBL compound properties (e.g. LogP, MW, PSA,..) are calculated?

      Yes, please refer to the following document: Property_Definitions. Please note, we are now using RDKit to calculate many of these properties.


    • What is pChEMBL?

      In addition to the conversion of published activity types/values/units to standard activity types/values/units, we have now added an additional field called pChEMBL to the Activities table. This value allows a number of roughly comparable measures of half-maximal response concentration/potency/affinity to be compared on a negative logarithmic scale. For example, an IC50 measurement of 1nM would have a pChEMBL value of 9. pChEMBL is defined as: -Log(molar IC50, XC50, EC50, AC50, Ki, Kd or Potency).


    • What is the Data Validity column?

      The Data Validity column has been added to the interface and to the database (as column data_validity_comment) to flag activity values that are outside a range typical for that activity type, or even for potentially missing data. For example, an IC50 of 23000000nM would be flagged as Outside Typical Range. The use of the flag will allow users to decide whether or not to retain those values in their results set after running a search on the interface or using the downloaded database. The flags currently in use in this field are:

      • Potential missing data
      • Potential author error
      • Manually validated
      • Potential transcription error
      • Outside typical range
      • Non standard unit for type
      • Author confirmed error


    • Can you provide more details on the removal of Metal-Containing compounds?

      The molfiles and images of a proportion of metal-containing compounds we removed from the ChEMBL interface and downloads set in ChEMBL_17. This was partially due to some of these compounds having coordinated metal bonds. As InChI limitations are such that these coordinate bonds could not generate a Standard InChI, our main compound indicator of uniqueness in ChEMBL, it was decided to exclude the structures altogether.

      The compound image on the interface was replaced with an icon that shows it is a metal-containing compound and the molfiles were removed from the download set on the FTP site. We will retain the molecular formula in both the download files and on the ChEMBL interface, so that the elemental make up of the compound is visible. This change does not affect the storage or display of the associated bioactivity data for these compounds.


    • Can you provide more details on bioactivity data standardisation?

      In addition to the conversion of published activity types/values/units to standard activity types/values/units, described in previous releases, a number of further enhancements have been made to the data in the activities table:

      1. Conversion of log/-log values to nM concentrations. For example pIC50 and log Ki have been converted to IC50 and Ki values.
      2. Rounding of standard values to three significant figures (or 2 decimal places for values > 10)
      3. Flagging of data with possible errors (using the data_validity_comment field), such as unusual units for the activity type, or very large/small numbers.
      4. Identification of potential duplicate values - where an activity measurement is likely to be a repeat citation of an earlier measurement, rather than an independent measurement (flagged using the potential_duplicate column).

      An additional table (ACTIVITY_STDS_LOOKUP) has been included in this release, which contains details of the standard_types that have been standardised, their permitted standard_units, and acceptable value ranges.


    • Can you provide more details on how to detect potentially duplicated data?

      We detected and flagged duplicated activity entries and potential transcription errors in activity records that come from publications. The former are records with identical compound, target, activity, type and unit values that were most likely reported as citations of measurements from previous papers, even when these measurements were subsequently rounded. The latter cases consist of otherwise identical entries whose activity values differ by exactly 3 or 6 orders of magnitude indicating a likely error in the units (e.g. uM instead of nM).


ChEMBL Download Questions

General Questions

Schema Questions and SQL Examples

  • General

    • Can you show the relationship between all the tables in the ChEMBL database?

      A PNG of the schema relationships can be found on the FTP site in the latest release directory: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest


    • Where can I get more information about the new classifications of target types in ChEMBL_15?

      This can be found in the following slides (Current for ChEMBL_15): Target Types


    • Retrieve all the bioactivity data for bacterial targets:

      SELECT md.chembl_id AS compound_chembl_id,
      cs.canonical_smiles,
      act.standard_type,
      act.standard_value,
      act.standard_units,
      td.chembl_id AS target_chembl_id,
      td.organism,   td.pref_name
      FROM target_dictionary td
        JOIN assays a ON td.tid = a.tid
        JOIN activities act ON a.assay_id = act.assay_id
        JOIN molecule_dictionary md ON act.molregno = md.molregno
        JOIN compound_structures cs ON md.molregno   = cs.molregno
        JOIN organism_class oc ON td.tax_id = oc.tax_id
          AND oc.L1 = 'Bacteria';
      


    • Retrieve activity details for compound and all its salts which have an IC50 bioactivity value in nM against a target of interest:

      -- Compound is Sildenafil (CHEMBL192)
      -- Target is human PDE5 (CHEMBL1827)
      SELECT m.chembl_id AS compound_chembl_id,
      s.canonical_smiles,
      r.compound_key,
      NVL(TO_CHAR(d.pubmed_id),d.doi) AS pubmed_id_or_doi,
      a.description                   AS assay_description,
      act.standard_type,
      act.standard_relation,
      act.standard_value,
      act.standard_units,
      act.activity_comment,
      t.chembl_id                    AS target_chembl_id,
      t.pref_name                    AS target_name,
      t.organism                     AS target_organism
      FROM compound_structures s
        RIGHT JOIN molecule_dictionary m ON s.molregno = m.molregno
        JOIN compound_records r ON m.molregno = r.molregno
        JOIN docs d ON r.doc_id = d.doc_id
        JOIN activities act ON r.record_id = act.record_id
        JOIN assays a ON act.assay_id = a.assay_id
        JOIN target_dictionary t ON a.tid = t.tid
          AND t.chembl_id      = 'CHEMBL1827'
          AND m.chembl_id IN
               (SELECT DISTINCT
                  m1.chembl_id
                FROM molecule_dictionary m1
                  JOIN molecule_hierarchy mh ON mh.molregno = m1.molregno
                  JOIN molecule_dictionary m2 ON mh.parent_molregno = m2.molregno
                    AND m2.chembl_id = 'CHEMBL192')
          AND act.standard_type = 'IC50'
          AND act.standard_units = 'nM';
      


    • Retrieve compounds which are selective to one target over a second target:

      -- Compounds which are selective for Human CDK2 (CHEMBL301) over Human CDK5 (CHEMBL4036) 
      -- Selectivity is based on comparing binding affinities using IC50 values.  
      SELECT md.chembl_id,
      cs.canonical_smiles
      FROM target_dictionary td
        JOIN assays a ON td.tid = a.tid
        JOIN activities act ON a.assay_id = act.assay_id
        JOIN molecule_dictionary md ON md.molregno = act.molregno
        JOIN compound_structures cs ON md.molregno = cs.molregno
          AND act.standard_relation = '='
          AND act.standard_type     IN ('IC50')
          AND act.standard_units    = 'nM'
          AND act.standard_value    < 50
          AND td.chembl_id          = 'CHEMBL301'
      INTERSECT
      SELECT md.chembl_id,
      cs.canonical_smiles
      FROM target_dictionary td
        JOIN assays a ON td.tid = a.tid
        JOIN activities act ON a.assay_id = act.assay_id
        JOIN molecule_dictionary md ON md.molregno = act.molregno
        JOIN compound_structures cs ON md.molregno = cs.molregno
      AND act.standard_relation     = '='
      AND act.standard_type         IN ('IC50')
      AND act.standard_units        = 'nM'
      AND act.standard_value        > 200
      AND td.chembl_id              = 'CHEMBL4036';
      


    • Retrieve target ChEMBL_ID, target_name, target_type, protein accessions and sequences for all protein targets:

      SELECT t.chembl_id AS target_chembl_id,
      t.pref_name        AS target_name,
      t.target_type,
      c.accession        AS protein_accession,
      c.sequence         AS protein_sequence
      FROM target_dictionary t
        JOIN target_type tt ON t.target_type = tt.target_type
        JOIN target_components tc ON t.tid = tc.tid
        JOIN component_sequences c ON tc.component_id = c.component_id
      AND tt.parent_type  = 'PROTEIN';
      


    • Retrieve PK data from 'Curated Drug Pharmacokinetic Data' source in ChEMBL for drug:

      -- Data for levofloxacin
      SELECT DISTINCT
        d.title,
        min(decode(ap.standard_type, 'DATASET', nvl(to_char(ap.standard_value), ap.standard_text_value)))         dataset,
        a.assay_id,
        a.description,
        min(decode(actp.standard_type, 'DOSED_COMPOUND_NAME',
                   nvl(to_char(actp.standard_value), actp.standard_text_value) || ' ' ||
                   actp.standard_units))                                                                          dosed_compound_name,
        min(decode(actp.standard_type, 'DOSE',
                   nvl(to_char(actp.standard_value), actp.standard_text_value) || ' ' || actp.standard_units))    dose,
        min(decode(actp.standard_type, 'DOSAGE_FORM',
                   nvl(to_char(actp.standard_value), actp.standard_text_value) || ' ' || actp.standard_units))    dosage_form,
        min(decode(actp.standard_type, 'REGIMEN',
                   nvl(to_char(actp.standard_value), actp.standard_text_value) || ' ' || actp.standard_units))    regimen,
        min(decode(actp.standard_type, 'ROUTE', nvl(to_char(actp.standard_value), actp.standard_text_value)))     route,
        min(decode(actp.standard_type, 'GENDER', nvl(to_char(actp.standard_value), actp.standard_text_value)))    gender,
        min(decode(actp.standard_type, 'AGE_RANGE', nvl(to_char(actp.standard_value), actp.standard_text_value))) age_range,
        min(decode(actp.standard_type, 'HEALTH_STATUS', nvl(to_char(actp.standard_value),
                                                            actp.standard_text_value)))                           health_status,
        min(decode(actp.standard_type, 'TISSUE', nvl(to_char(actp.standard_value),
                                                            actp.standard_text_value)))                           tissue,
        cr.molregno,
        cr.compound_name,
        act.activity_id,
        act.toid,
        act.standard_type,
        act.standard_relation,
        act.standard_value,
        act.standard_units,
        act.activity_comment
      FROM source s
        JOIN compound_records cr ON s.src_id = cr.src_id
        JOIN docs d ON d.doc_id = cr.doc_id
        JOIN activities act ON cr.record_id = act.record_id AND cr.doc_id = act.doc_id
        JOIN activity_properties actp ON act.activity_id = actp.activity_id
        JOIN assays a ON act.assay_id = a.assay_id
        JOIN assay_parameters ap ON a.assay_id = ap.assay_id
                                    AND s.src_description = 'Curated Drug Pharmacokinetic Data'
                                    AND cr.compound_name LIKE 'LEVOFLOXACIN%'
      GROUP BY d.title, a.assay_id, a.description, cr.molregno, cr.compound_name, act.activity_id, act.toid,
        act.standard_type, act.standard_relation, act.standard_value, act.standard_units, act.activity_comment
      ORDER BY cr.compound_name, act.toid, act.standard_type;
      


    • Retrieve compound activity details for all targets containing a protein of interest:

      -- Protein of interest is human M2 muscarinic receptor (P08172) 
      SELECT DISTINCT
        m.chembl_id                      AS compound_chembl_id,
        s.canonical_smiles,
        r.compound_key,
        NVL(TO_CHAR(d.pubmed_id), d.doi) AS pubmed_id_or_doi,
        a.description                    AS assay_description,
        act.standard_type,
        act.standard_relation,
        act.standard_value,
        act.standard_units,
        act.activity_comment,
        t.chembl_id                      AS target_chembl_id,
        t.pref_name                      AS target_name,
        t.target_type
      FROM compound_structures s
        RIGHT JOIN molecule_dictionary m ON s.molregno = m.molregno
        JOIN compound_records r ON m.molregno = r.molregno
        JOIN docs d ON r.doc_id = d.doc_id
        JOIN activities act ON r.record_id = act.record_id
        JOIN assays a ON act.assay_id = a.assay_id
        JOIN target_dictionary t ON a.tid = t.tid
        JOIN target_components tc ON t.tid = tc.tid
        JOIN component_sequences cs ON tc.component_id = cs.component_id
          AND cs.accession = 'P08172';
      


    • Retrieve compound activity details for a target:

      -- Target is Human PDE5 (CHEMBL1827) 
      SELECT m.chembl_id AS compound_chembl_id,   
      s.canonical_smiles,   
      r.compound_key,   
      NVL(TO_CHAR(d.pubmed_id),d.doi) AS pubmed_id_or_doi,   
      a.description                   AS assay_description,   act.standard_type,   
      act.standard_relation,   
      act.standard_value,   
      act.standard_units,   
      act.activity_comment 
      FROM compound_structures s,   
      molecule_dictionary m,   
      compound_records r,   
      docs d,   
      activities act,   
      assays a,   
      target_dictionary t 
      WHERE s.molregno (+) = m.molregno 
      AND m.molregno       = r.molregno 
      AND r.record_id      = act.record_id 
      AND r.doc_id         = d.doc_id 
      AND act.assay_id     = a.assay_id 
      AND a.tid            = t.tid 
      AND t.chembl_id      = 'CHEMBL1827';
      


    • Can you show me how to use SQL to only extract the PubChem data from ChEMBL?

      -- The source id for PubChem data is found in the SOURCE table and is ‘7’.
      -- Please note that this will bring back over 4,000,000 data points
      SELECT DISTINCT
        md.molregno,
        cs.canonical_smiles,
        md.chembl_id,
        act.standard_type,
        act.standard_value,
        act.standard_units
      FROM activities act
        JOIN molecule_dictionary md ON act.molregno = md.molregno
        JOIN compound_structures cs ON md.molregno = cs.molregno
        JOIN compound_records cr ON cr.molregno = act.molregno
        JOIN source src ON src.src_id = cr.src_id
          AND src.src_id = '7';
      


spacer
spacer