Prevalence, causes and significance of short-range template-switch mutations in humans and model organisms

Prevalence, causes and significance of short-range template-switch mutations in humans and model organisms

EBPOD 2017: Project 6

This is one of 11 joint postdoctoral fellowships offered by EMBL-EBI, the NIHR Cambridge Biomedical Research Centre and the University of Cambridge’s School of the Biological Sciences in 2017.

Principal Investigators

This project builds on the work recently published by Nick Goldman (co-author Ari Löytynoja, University of Helsinki; in press at Genome Research ), which has detected that human populations (and presumably other eukaryotes) undergo mutation in which short genome regions (typically up to 50bp) are replaced during replication by similar-length fragments copied from a nearby location on the complementary strand. Analyzed bioinformatically the process seems similar to that described for large-scale genome rearrangements, where thousands to millions of bp may be inserted, from distant genome locations. However, this local switching has been overlooked to date and there been no investigation of the underlying molecular mechanisms.

The local template-switching process can explain many apparent clusters of mutations as each being the result of a single template switch-and-return event, rather than an implausible coincidence of multiple substitutions, insertions and deletions. Löytynoja and Goldman detected thousands of such events since the divergence of humans and chimpanzees, and hundreds in the comparison of two independently assembled human genomes. They also showed that current (human) population resequencing pipelines based on mapping short read data to reference sequences make systematic errors around regions close to these events, due to the failure of mapping algorithms to place reads with multiple differences from the reference (Fig. 1). Mutations are frequently overlooked or, if detected, miscalled as numerous close SNPs and indels. This could have consequences for our understanding of local mutation rates and for the detection of selection.

More generally, our understanding of the diversity and impact of genomic mutations, their variation in different contexts, and the mechanisms involved, remains limited. Germline de novo mutations in children are an important cause of rare genetic disease, and somatic mutations are a defining feature of ageing and cancer cells. Studies from yeast to humans have suggested links between genome duplication errors and a multitude of different types of genetic instability at particular chromosomal loci. Even if they are rare, multi-nucleotide mutation events such as this may have a disproportionate functional impact. Understanding their causes will therefore be important to realize the diagnostic and prognostic potential of patient genome sequencing.

Following on from Löytynoja and Goldman’s study, we propose a follow-up that has multiple aims and can be adapted to be suitable as a PhD project or a postdoc-level project, depending on the bioinformatics and laboratory skills and experience of applicants. Having established that short-range template-switching occurs, we want to characterise the process better, gain more accurate estimates of its prevalence and causes, and investigate the consequences for understanding population variation, evolution and disease. Data are now available for humans and other species (particularly germline data from family trios and matched tumour/normal cancer data) which enable us to address this question. (Moritz Gerstung from EMBL-EBI has considerable expertise in computational cancer biology and has agreed to collaborate with us on cancer analyses.)

A goal for the initial stage of the project will be to use published datasets — for example human de novo data from the Genomes of the Netherlands project — to better quantify the true prevalence of template-switch mutations, which may be underestimated in variation datasets. This will involve data processing issues, not least to overcome the problem that accurate de novo assemblies are needed (or at least local assemblies or the development of a hybrid strategy based on the potential existence of difficult-to-map reads covering template-switch events). Indeed, the project will offer useful training opportunities in computational and bioinformatic analysis, starting with relatively straightforward application of existing tools and potentially progressing to more advanced methods, and will thus be appropriate for strong applicants with an experimental background. There will also be an opportunity to explore the population genetic and evolutionary implications of these processes.

Further work will then be possible on the forthcoming 100k Genomes Project, for which data is expected to be available by late 2017. We will also investigate the same issues in model organisms such as budding yeast and C. elegans, where there should be enough data to detect numerous potential events; this (perhaps in combination with the human data) will potentially allow identification of promoting or inhibitive genomic factors. The emphasis on these models will mean an easy transfer to experimental work in the Zegerman lab, with a view to testing or screening in vivo. This may include targeted mutagenesis assays, followed by genome sequencing; of particular relevance for this proposal is the fact that a researcher inexperienced in laboratory work will have access to the simple budding yeast model, while more sophisticated work on other organisms is also possible there, meaning the project can be adapted to be suitable for post-graduate or post-doctoral scientists, with or without advanced laboratory training.

EBPOD 2017: Goldman, Scally, Zegerman: image

Complex mutation partially called in 1000 Genomes (1kG) data. a. A mutation pattern between two humans appears as 20 SNPs and an indel in a simple pairwise alignment, but is perfectly explained by a single template-switch event. b. Sequence reads indicate that HuRef is actually a heterozygote; NA12872 and NA12873 are a homozygous variant and heterozygous, resp., contrary to the interpretations made in the 1kG study. c. Failure to map reads with significant overlap with the affected region lead to reduced coverage for heterozygotes and homozygote variants across the 1kG study. d. As a consequence, terminal differences are called more reliably and central ones are recognized with lower frequency; in fact, they should appear in equal frequency within each subpopulation.

Supervisory team

Aylwyn Scally leads a group focused on human evolutionary genetics at the Department of Genetics in Cambridge, with particular interest in the origins of germline mutation and its variation within and between species. He also co-leads a Germline Mutation subproject of the 100k Genomes Project, which will provide access (from late 2017) to whole genome sequence and parental age data for 20,000 family trios. This will be an ideal dataset to investigate diverse mutation mechanisms in general and for the student or postdoc on this project to study template switching errors in particular.

Nick Goldman leads a research group at EMBL-European Bioinformatics Institute primarily focussed on data analysis methods for evolutionary studies. This includes study of the mathematical and statistical fundamentals of phylogenetic inference, probabilistic modelling of genome evolution, incorporation of evolutionary models into algorithms and statistical analyses, and applications in comparative genomics.

Philip Zegerman’s research group at the Gurdon Institute, Cambridge, uses multiple model systems including budding yeast, C. elegans , X. laevis and human cells to address the conserved mechanisms of DNA replication control and the cellular consequences when this control is lost.