Europe PMC and Open Targets Platform

Europe PMC uses machine learning, specifically text mining, to provide articles with annotations:

“Europe PMC is an open science platform that enables access to a worldwide collection of life science publications and preprints from trusted sources around the globe.” One of the tools provided by Europe PMC is the annotation API. “Annotations are biological terms or relations, such as diseases, chemicals or protein interactions, which can be highlighted for readers on abstracts and full text articles. These terms are identified by text mining algorithms, developed by a variety of text mining groups.” (from ePMC website: https://europepmc.org/Annotations) .

Bibliography is a unique source of information to identify and prioritise targets; the whole corpus of literature represents the accumulated scientific knowledge that drives therapeutic innovations. The Europe PMC dataset is used by Open Targets in two ways: 

To extract evidence for target disease associations. A pipeline developed by Europe PMC and Open Targets identifies target-disease co-occurrences in the literature and provides an assessment on the confidence of the relationship. The pipeline uses deep-learning based Named Entity Recognition (NER) to identify targets (usually genes or proteins) and diseases mentioned in publications or preprints. 

All co-occurrences of both types of entities in the same sentence are considered evidence (Figure 30).


Figure 30 Text mining evidence for the ACE2-COVID-19 association in the Open Targets Platform (release 23.12).

To provide context to entities and identify similar entities in the literature. In the Open Targets Platform, users can browse the available literature for the entity (target, disease, drug) of their choice. An additional functionality developed using a Word2Vec ML model identifies similar entities in the literature, suggesting connections that our users may not be aware of (Figure 31). Entities are said to be similar to each other when they are both likely to co-occur surrounded by the same entities in specific sections of publications across the whole corpus of scientific literature.

For a deep-dive into how this pipeline was developed, read the post on the Open Targets blog.

Figure 31 The Open Targets Literature pipeline uses a dictionary-based approach to ground the tagged text from Europe PMC’s NER dataset to the Platform entities: targets, diseases, and drugs (Entity normalisation). Matches in Europe PMC’s dataset are tagged to the Platform’s entity IDs and disambiguated by applying a series of standard NLP tools. The matched entities provide literature evidence for target-disease associations (Association evidence). They are also further used to train a Word2Vec skip-gram, the result of which is a set of vectors, where each vector represents a normalised word embedding from the target, disease, and drug collections (Word2Vec ML model). This allows the user to query other similar entities to the one selected. As the user selects additional entities, the widget narrows the subset of publications to the intersection of all publications that mention the selected terms (Universe of entities).

Find out more in the webinar run jointly by Europe PMC and Open Targets.