Recorded webinar

Developing a dataset for LLM projects

As large language models (LLMs) continue to revolutionise artificial intelligence applications, the importance of high-quality data preparation has never been more critical. This webinar dives into the art and science of preparing datasets for effective LLM training, offering actionable insights for AI practitioners, data scientists, and engineers.

We will explore the end-to-end process of data preparation, beginning with data collection strategies and progressing through cleaning, preprocessing, tokenisation, and annotation. Emphasis will be placed on identifying and mitigating biases, managing multilingual datasets, and ensuring data quality and diversity to enhance model performance. Real-world case studies will illustrate common pitfalls and solutions, while hands-on demonstrations will provide practical techniques for optimising datasets.

Participants will gain a deeper understanding of how well-structured and curated data can significantly impact an LLM’s capabilities, reduce training costs, and improve ethical AI outcomes. Whether you are building LLMs from scratch or fine-tuning existing models, this session will equip you with the knowledge to leverage your data assets effectively.

Join us to unlock the potential of data preparation and enable your LLMs to achieve unparalleled performance and generalisation.

Who is this course for?

This webinar is designed for bioinformaticians, computational biologists, data scientists, and researchers interested in applying AI and language models to biological problems.

This event is part of a webinar series exploring the revolutionary potential of Large Language Models (LLMs) in bioinformatics and computational biology. For details on all topics covered in this series and registration information, please visit the following link: Large Language Models and their applications in Bioinformatics

Outcomes

By the end of the webinar you will be able to:

Describe the complete data preparation pipeline for LLM training, including practical techniques for collection, cleaning, preprocessing, tokenization, and annotation of datasets
Explore strategies to identify and address data biases, handle multilingual content, and maintain data quality standards that directly impact model performance and ethical considerations
Learn from real-world examples and common challenges in dataset preparation, with demonstrated solutions and best practices for optimising training efficiency and reducing costs
Utilise practical data curation techniques that can be applied to both building new LLMs and fine-tuning existing models

DOI: 10.6019/TOL.DatasetLLM-w.2025.00001.1

Duration: 00:58:54

5 March 2025

Online

Free

Contact
Ajay Mishra

Organisers

Andrew Green
EMBL-EBI
Ajay Mishra
EMBL-EBI

Speakers

Melanie Vollmar
EMBL-EBI

All materials are free cultural works licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, except where further licensing details are provided.

Share this event with: