Repeat associated mechanisms of genome evolution and function revealed by the Mus caroli and Mus pahari genomes

Repeat associated mechanisms of genome evolution and function revealed by the Mus caroli and Mus pahari genomes

David Thybert1,2, Maša Roller1, Fábio C.P. Navarro3, Ian Fiddes4, Ian Streeter1, Christine Feig5, David Martin-Galvez1, Mikhail Kolmogorov6, Václav Janoušek7, Wasiu Akanni1, Bronwen Aken1, Sarah Aldridge5,8, Varshith Chakrapani1, William Chow8, Laura Clarke1, Carla Cummins1, Anthony Doran8, Matthew Dunn8, Leo Goodstadt9, Kerstin Howe3, Matthew Howell1, Ambre-Aurore Josselin1, Robert C. Karn10, Christina M. Laukaitis10, Lilue Jingtao8, Fergal Martin1, Matthieu Muffato1, Stephanie Nachtweide11, Michael A. Quail8, Cristina Sisu3, Mario Stanke11, Klara Stefflova5, Cock Van Oosterhout12, Frederic Veyrunes13, Ben Ward2, Fengtang Yang8, Golbahar Yazdanifar10, Amonida Zadissa1, David Adams8, Alvis Brazma1, Mark Gerstein3, Benedict Paten4, Son Pham14, Thomas Keane1,8, Duncan T Odom5,8*, Paul Flicek1,8*

 

1. European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
2. Earlham Institute, Norwich research Park, Norwich, NR4 7UH, United Kingdom
3. Yale University Medical School, Computational Biology and Bioinformatics Program, New Haven, Connecticut 06520, USA
4. Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
5. University of Cambridge, Cancer Research UK Cambridge Institute, Robinson Way, Cambridge CB2 0RE, UK
6. Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92092
7. Department of Zoology, Faculty of Science, Charles University in Prague, Prague, Czech Republic Institute of Vertebrate Biology, ASCR, Brno, Czech Republic
8. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
9. Wellcome Trust Centre for Human Genetics, Oxford, UK.
10. Department of Medicine, College of Medicine, University of Arizona.
11. Institute of Mathematics and Computer Science, University of Greifswald, Greifswald, 17487, Germany
12. School of Environmental Sciences, University of East Anglia, Norwich Research Park, Norwich, United Kingdom
13. Institut des Sciences de l’Evolution de Montpellier, Université Montpellier / CNRS, 34095 Montpellier, France
14. Bioturing Inc, San Diego, California

Description

Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 to 6 MYA, but that are absent in the Hominidae. In fact, Hominidae show between four- and seven-fold lower rates of nucleotide change and feature turnover in both neutral and functional sequences suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. For example, recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli. This process resulted in thousands of novel, species-specific CTCF binding sites. Our results demonstrate that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology.

Full details are provided in our open access publication in Genome Research

Data access

The genome assemblies of Mus caroli and Mus pahari were submitted to the European Nucleotide Archive and are available with accession numbers GCA_900094665 for Mus caroli and GCA_900095145 for Mus pahari. All reads from the ChIP-seq and RNA-seq experiments in this study were submitted to ArrayExpress and are available with accession numbers E-MTAB-5768 (RNA-seq) and E-MTAB-5769 (ChIP-seq).

Transposible element annotation

The results of RepeatMasker identifications were postprocessed to merge fragmented hits and remove non-transposable repeats to create the final set used in this study. The results of postprocessing are available here. The columns in the postprocessed files correspond to:

Column number Description
1 Toplevel genomic segment, i.e. chromosome, scaffold
2 Start position of match in genomic segment
3 End position of match in genomic segment
4 RepeatMasker result: % substitutions in matching region compared to the consensus
5 RepeatMasker result: % of bases opposite a gap in the query sequence (deleted bp)
6 RepeatMasker result: % of bases opposite a gap in the repeat consensus (inserted bp)
7 Transposable element classification: transposable element class
8 Transposable element classification: transposable element family
9 Transposable element classification: transposable element subfamily
10 Unique id created for this study

Whole genome alignments

The whole genome alignments used in this study are available here in Ensembl Multi Format (EMF) and multiple alignment format (MAF). Pairwise whole genome alignments were generated using LastZ and multiple whole genome alignments with the Enredo-Pecan-Ortheus (EPO) pipeline. For more information please refer to the Methods in our paper.

CTCF occupancy sites

The peak sets per biological replicate are available in ArrayExpress under the accession number E-MTAB-5769. The peaks present in at least two biological replicates are available here. The B2_Mm1 transposable elements used as input for building the neighbour joining tree are available here. For more information please refer to the Methods in our paper.

CTCF occupancy sites associated with repetitive elements are available here. The columns in the postprocessed files describe:

Column name Description
PeakChr Toplevel genomic segment containing the CTCF peak, i.e. chromosome, scaffold
PeakStart Start position of CTCF peak in genomic segment
PeakEnd End position of CTCF peak in genomic segment
RepeatChr Toplevel genomic segment containing the repeat element, i.e. chromosome, scaffold consensus
RepeatStart Start position of repeat element in genomic segment
RepeatEnd End position of repeat element in genomic segment
RepeatClass Transposable element classification: transposable element class
RepeatFamily Transposable element classification: transposable element family
RepeatSubfamily Transposable element classification: transposable element subfamily

Results of multiple alignments of CTCF sites between the rodents are available here. The name of the file corresponds to the species which was used as query to align to other species. The first column contains all CTCF sites from the query species, and each following column an alignment to another species. If a column contains genomic coordinates, an bound CTCF site is aligned. A lack of aligned CTCF sites is marked with an "X". A row with "X" in all columns except the first represents a species-specific CTCF binding site. The position of peaks is in "Chr:Start-End" format.

Mus pahari breakpoints

chr1: mm_7:27,524,252-end + mm_19:start-end
chr2: mm_5:start-30,341,378 + mm_6:strt-end
chr3: mm_2:23,133,000-end
chr4: mm_3:start-end
chr5: mm_1:18,015,577-end
chr6: mm_4:45,383,207-end
chr7: mm_12:start-end
chr8: mm_14:start-end
chr9: mm_10:33,415,951-end
chr10: mm_9:start-end
chr11: mm_13:[66,989,676/67,139,392]-end + mm_15:start-32,682,454
chr12: mm_16:start-end
chr13: mm_11:start-31,004,372+mm_5:33,090,849-[109,638,275/110,356,165]
chr14: mm_11:31,014,735-end
chr15: mm_18:start-end
chr16: mm_2:start-23,116,788 + mm_13:start-[66,989,676/67,139,392]
chr17: mm_15:32,682,454-end
chr18: mm_17:31,811,330-end
chr19: mm_7:start-27,524,252+ mm_8:start-75,294,883
chr20: mm_8:75,299,631-end
chr21: mm_17:start-27,052,643 + mm_10:start-33,381,753
chr22: mm_1:start-17,966,305 + mm_4:start-45,383,207-end
chr23: mm_5:[109,638,275/110,356,165]-end

The above lines refer to the composition of the Music pahari chromosomes. For example, chr1 of Mus pahari is composed of chr7 of Mus musculus starting from the position 27,524,252 until the end and then chr19 of Mus musculus from start to end. This gives a break point from the Mus musculus perspective.

For data points such as [66,989,676/67,139,392], this results from the array CGH being done twice. We have the mean between these two values for the analysis of repeat enrichment in the paper.