Nassar2022 - Metagenomics Classification Task for Scientific Literature Text

  public model
Model Identifier
Short description
This is a use case to show that, given any automatic metagenomic classification model for the documents, we can convert those to ONNX (Open Neural Network Exchange) format; it also consists of the Dockerfile that can be used to prepare a docker image. This conversion ensures interoperability and open access. The ONNX format utility can perform the following essential tasks: model conversion, inference, inspection, and optimization. Reference: 1) 2) 3) 4) This model is built upon the model of the following publication: Maaly Nassar, Alexander B Rogers, Francesco Talo', Santiago Sanchez, Zunaira Shafique, Robert D Finn, Johanna McEntyre, A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications, GigaScience, Volume 11, 2022, giac077,
Open Neural Network Exchange
Related Publication
  • A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications.
  • Maaly Nassar, Rogers AB, Talo' F, Sanchez S, Shafique Z, Finn RD, McEntyre J
  • GigaScience , 8/ 2022 , Volume 11 , pages: giac077 , DOI: 10.1093/gigascience/giac077
  • European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
  • Metagenomics is a culture-independent method for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally/taxonomically), either from a longitudinal study or cross-sectional studies, can provide clues into how the microbiota has adapted to the environment. However, a recurring challenge, especially when comparing results between independent studies, is that key metadata about the sample and molecular methods used to extract and sequence the genetic material are often missing from sequence records, making it difficult to account for confounding factors. Nevertheless, these missing metadata may be found in the narrative of publications describing the research. Here, we describe a machine learning framework that automatically extracts essential metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework has enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in European Nucleotide Archive (ENA) and MGnify. Using this framework, a new metagenomics annotations pipeline was developed and integrated into Europe PMC to regularly enrich up-to-date ENA and MGnify metagenomics studies with metadata extracted from research articles. These metadata are now available for researchers to explore and retrieve in the MGnify and Europe PMC websites, as well as Europe PMC annotations API.
Submitter of the first revision: Sucheta Ghosh
Submitter of this revision: Sucheta Ghosh
Modellers: Sucheta Ghosh

Metadata information

is (1 statement)
BioModels Database MODEL2304210001

isDescribedBy (1 statement)

Curation status

Modelling approach(es)

Connected external resources

Name Description Size Actions

Model files

doc2vec-model.onnx This is the best performing model in ONNX format 845.42 KB Preview | Download

Additional files The compressed zipfile consists of code and files required to reproduce the result 286.40 KB Preview | Download

  • Model originally submitted by : Sucheta Ghosh
  • Submitted: May 25, 2023 3:40:45 PM
  • Last Modified: May 25, 2023 3:40:45 PM
  • Version: 2 public model Download this version
    • Submitted on: May 25, 2023 3:40:45 PM
    • Submitted by: Sucheta Ghosh
    • With comment: minor revision
Curator's comment:
(added: 25 May 2023, 15:39:36, updated: 25 May 2023, 15:39:36)
Classifying the documents to prepare a publication triage for various microbiome environments is essential. Nassar et al. 2022 addressed this requirement by constructing supervised training data sets where the GOLD annotations were assigned to metagenomics studies. The data sets mapped to the corresponding MGnify cross-referenced publications. The random forest models were trained on the hierarchical levels of GOLD ontology, yielding diverse biome prediction models. Nassar et al. 2022 used ten documents as a test set. They set a threshold of 0.40 prediction probability for the top GOLD hierarchical levels (Engineered, Environmental, and Host-associated). We compare the performance of the original model and the ONNX model. We show the equivalent results of the models for the 2-digit significance level after the decimal.