ChEMBL data overview


What data does ChEMBL contain?

The ChEMBL data is a combination of extracted data from literature and donated data sets from companies such as GSK and PubChem. The curated data is made up of targets, organisms, compounds and their associated bioactivities (Figure 3).

Figure 3 ChEMBL resources.

Where does the data come from?

The majority of the data is extracted from literature, coming from a selection of 47 journals. The most popular journals used include: Journal of Medicinal Chemistry, Bioorganic and Medicinal Chemistry and Bioorganic and Medicinal Chemistry Letters (Figure 4).

Figure 4 Data extraction from the literature.

Extracted target types

All target types that are reported in the literature are stored in ChEMBL (Figure 5).

Target Types

Figure 5 Target types in ChEMBL.

PubChem Data

A subset of PubChem assays (confirmatory and panel assays with doseresponse endpoints) have been loaded into ChEMBL. Assays from PubChem are clearly marked, both on the ChEMBL interface and in the database. This allows you to easily determine where data have originated, while being able to retrieve more information through a single point of access. This led to the addition of over 600,000 compounds to the database, as well as 7,000,000 bioactivities (Figure 6).

Figure 6 PubChem data.

Other Datasets

As well as the PubChem data, we have also had depositions from other companies and consortiums, thereby allowing us to expand our database. One such deposition is the Neglected Tropical Disease (NTD) dataset, which was donated by companies such as GSK and Novartis (Figure 7).

Figure 7 Neglected Tropical Disease (NTD) depositions.