Recorded webinar
Developing a dataset for LLM projects
As large language models (LLMs) continue to revolutionise artificial intelligence applications, the importance of high-quality data preparation has never been more critical. This webinar dives into the art and science of preparing datasets for effective LLM training, offering actionable insights for AI practitioners, data scientists, and engineers.
We will explore the end-to-end process of data preparation, beginning with data collection strategies and progressing through cleaning, preprocessing, tokenisation, and annotation. Emphasis will be placed on identifying and mitigating biases, managing multilingual datasets, and ensuring data quality and diversity to enhance model performance. Real-world case studies will illustrate common pitfalls and solutions, while hands-on demonstrations will provide practical techniques for optimising datasets.
Participants will gain a deeper understanding of how well-structured and curated data can significantly impact an LLM’s capabilities, reduce training costs, and improve ethical AI outcomes. Whether you are building LLMs from scratch or fine-tuning existing models, this session will equip you with the knowledge to leverage your data assets effectively.
Join us to unlock the potential of data preparation and enable your LLMs to achieve unparalleled performance and generalisation.
Who is this course for?
This webinar is designed for bioinformaticians, computational biologists, data scientists, and researchers interested in applying AI and language models to biological problems.
This event is part of a webinar series exploring the revolutionary potential of Large Language Models (LLMs) in bioinformatics and computational biology. For details on all topics covered in this series and registration information, please visit the following link: Large Language Models and their applications in Bioinformatics
Outcomes
By the end of the webinar you will be able to:
- Describe the complete data preparation pipeline for LLM training, including practical techniques for collection, cleaning, preprocessing, tokenization, and annotation of datasets
- Explore strategies to identify and address data biases, handle multilingual content, and maintain data quality standards that directly impact model performance and ethical considerations
- Learn from real-world examples and common challenges in dataset preparation, with demonstrated solutions and best practices for optimising training efficiency and reducing costs
- Utilise practical data curation techniques that can be applied to both building new LLMs and fine-tuning existing models
DOI:
10.6019/TOL.DatasetLLM-w.2025.00001.1
This webinar took place on 5th March 2025. Please click the 'Watch video' button to view the recording.