Tutorial 1: Working with next-generation sequencing data - A short primer on QC, alignment, and variation analysis of next-generation sequencing data

Presenters

Thomas Keane

Thomas Keane completed his PhD degree in the area of distributed computing and high-throughput phylogenomics from NUI Maynooth (Ireland) in 2006. He subsequently moved to the Pathogen Genomics group at the Wellcome Trust Sanger Institute to work on sequence assembly of several pathogens such as Plasmodium falciparum strains, Trypanosoma brucei, and several other pathogen genomes. In 2008, he co-founded the Vertebrate Resequencing Informatics group and manages the sequencing, informatics, and variation pipelines for large projects such as the 1000 genomes (http://www.1000genomes.org) and mouse genomes project (http://www.sanger.ac.uk/mousegenomes).

Jan Aerts

Jan Aerts received his PhD at Wageningen University (Netherlands) in 2005 on the subject of chicken genome mapping and sequencing. After a post-doc position at the Roslin Institute near Edinburgh in Scotland - working on the cow genome assembly - he now works at the Wellcome Trust Sanger Institute near Cambridge (UK). His current work includes downstream analysis of next-generation sequencing data in order to identify putative SNPs and indels. He will start an assistant-professorship at the University of Leuven in October.

Motivation

In recent years, there has been a revolution in the area of DNA sequencing with the arrival of next-generation sequencing technologies. These technologies have resulted in a huge reduction in the cost of sequencing and has meant that many new researchers now have access to raw sequencing data. The type and volume of the data produced by next-generation sequencing machines presents many previously unseen informatics challenges.

Goals

This tutorial/workshop will help people who are just getting started on nextgen sequencing get an idea of the tools, flows, and procedures that they may need to set up to handle this data. In this short course, we will introduce the participants to the different next-generation sequencing technologies, show how to do some basic quality checking of the data, how to run the various next-generation alignment tools, create de novo sequence assemblies, and call variants (such as SNPs, short indels, and structural variants) from a reference sequence.

Prerequisites

  • An interest in genome sequencing
  • Basic UNIX skills

Tutorial outline

Slot 1 - Thomas

Overview of next-generation sequencing technologies
Applications of next-generation sequencing
Quality control measures and metrics of libraries/lanes
Data storage (file formats) and meta data

Slot 2 - Thomas

Introduction to short read alignment algorithms and tools
Practical instructions and examples on use of short read aligners
Parallelising short read alignment
Introduction to sequence assembly methods and tools
Practical instructions on use of short read assembly tools to get the optimal
sequence assembly

Slot 3 - Jan

Overview of variation calling from next-generation sequence data
SNP calling theory and tools
Short indel calling theory and tools
Practical examples of variation calling
File formats for variation calling and storage

Slot 4 - Jan

Introduction to structural variation
Summary of different types of structural variants (Large insertions, deletions,
inversions, translocations, copy number variants)
Overview of algorithms and tools for calling structural variants
Practical examples of calling structural variants
Visualisation of structural variants

Practical issues

All slides and data from examples will be available online.