Repeat associated mechanisms of genome evolution and function revealed by the Mus caroli and Mus pahari genomes
David Thybert1,2, Maša Roller1, Fábio C.P. Navarro3, Ian Fiddes4, Ian Streeter1, Christine Feig5, David Martin-Galvez1, Mikhail Kolmogorov6, Václav Janoušek7, Wasiu Akanni1, Bronwen Aken1, Sarah Aldridge5,8, Varshith Chakrapani1, William Chow8, Laura Clarke1, Carla Cummins1, Anthony Doran8, Matthew Dunn8, Leo Goodstadt9, Kerstin Howe3, Matthew Howell1, Ambre-Aurore Josselin1, Robert C. Karn10, Christina M. Laukaitis10, Lilue Jingtao8, Fergal Martin1, Matthieu Muffato1, Stephanie Nachtweide11, Michael A. Quail8, Cristina Sisu3, Mario Stanke11, Klara Stefflova5, Cock Van Oosterhout12, Frederic Veyrunes13, Ben Ward2, Fengtang Yang8, Golbahar Yazdanifar10, Amonida Zadissa1, David Adams8, Alvis Brazma1, Mark Gerstein3, Benedict Paten4, Son Pham14, Thomas Keane1,8, Duncan T Odom5,8*, Paul Flicek1,8*
1. European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
2. Earlham Institute, Norwich research Park, Norwich, NR4 7UH, United Kingdom
3. Yale University Medical School, Computational Biology and Bioinformatics Program, New Haven, Connecticut 06520, USA
4. Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
5. University of Cambridge, Cancer Research UK Cambridge Institute, Robinson Way, Cambridge CB2 0RE, UK
6. Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92092
7. Department of Zoology, Faculty of Science, Charles University in Prague, Prague, Czech Republic Institute of Vertebrate Biology, ASCR, Brno, Czech Republic
8. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
9. Wellcome Trust Centre for Human Genetics, Oxford, UK.
10. Department of Medicine, College of Medicine, University of Arizona.
11. Institute of Mathematics and Computer Science, University of Greifswald, Greifswald, 17487, Germany
12. School of Environmental Sciences, University of East Anglia, Norwich Research Park, Norwich, United Kingdom
13. Institut des Sciences de l’Evolution de Montpellier, Université Montpellier / CNRS, 34095 Montpellier, France
14. Bioturing Inc, San Diego, California
Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 to 6 MYA, but that are absent in the Hominidae. In fact, Hominidae show between four- and seven-fold lower rates of nucleotide change and feature turnover in both neutral and functional sequences suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. For example, recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli. This process resulted in thousands of novel, species-specific CTCF binding sites. Our results demonstrate that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology.
Full details are provided in our open access publication in Genome Research.
The genome assemblies of Mus caroli and Mus pahari were submitted to the European Nucleotide Archive and are available with accession numbers GCA_900094665 for Mus caroli and GCA_900095145 for Mus pahari. All reads from the ChIP-seq and RNA-seq experiments in this study were submitted to ArrayExpress and are available with accession numbers E-MTAB-5768 (RNA-seq) and E-MTAB-5769 (ChIP-seq).
Transposible element annotation
The results of RepeatMasker identifications were postprocessed to merge fragmented hits and remove non-transposable repeats to create the final set used in this study. The results of postprocessing are available here. The columns in the postprocessed files correspond to:
|1||Toplevel genomic segment, i.e. chromosome, scaffold|
|2||Start position of match in genomic segment|
|3||End position of match in genomic segment|
|4||RepeatMasker result: % substitutions in matching region compared to the consensus|
|5||RepeatMasker result: % of bases opposite a gap in the query sequence (deleted bp)|
|6||RepeatMasker result: % of bases opposite a gap in the repeat consensus (inserted bp)|
|7||Transposable element classification: transposable element class|
|8||Transposable element classification: transposable element family|
|9||Transposable element classification: transposable element subfamily|
|10||Unique id created for this study|
Whole genome alignments
The whole genome alignments used in this study are available here in Ensembl Multi Format (EMF) and multiple alignment format (MAF). Pairwise whole genome alignments were generated using LastZ and multiple whole genome alignments with the Enredo-Pecan-Ortheus (EPO) pipeline. For more information please refer to the Methods in our paper.
CTCF occupancy sites
The peak sets per biological replicate are available in ArrayExpress under the accession number E-MTAB-5769. The peaks present in at least two biological replicates are available here. The B2_Mm1 transposable elements used as input for building the neighbour joining tree are available here. For more information please refer to the Methods in our paper.
CTCF occupancy sites associated with repetitive elements are available here. The columns in the postprocessed files describe:
|PeakChr||Toplevel genomic segment containing the CTCF peak, i.e. chromosome, scaffold|
|PeakStart||Start position of CTCF peak in genomic segment|
|PeakEnd||End position of CTCF peak in genomic segment|
|RepeatChr||Toplevel genomic segment containing the repeat element, i.e. chromosome, scaffold consensus|
|RepeatStart||Start position of repeat element in genomic segment|
|RepeatEnd||End position of repeat element in genomic segment|
|RepeatClass||Transposable element classification: transposable element class|
|RepeatFamily||Transposable element classification: transposable element family|
|RepeatSubfamily||Transposable element classification: transposable element subfamily|
Results of multiple alignments of CTCF sites between the rodents are available here. The name of the file corresponds to the species which was used as query to align to other species. The first column contains all CTCF sites from the query species, and each following column an alignment to another species. If a column contains genomic coordinates, an bound CTCF site is aligned. A lack of aligned CTCF sites is marked with an "X". A row with "X" in all columns except the first represents a species-specific CTCF binding site. The position of peaks is in "Chr:Start-End" format.
Mus pahari breakpoints
chr1: mm_7:27,524,252-end + mm_19:start-end chr2: mm_5:start-30,341,378 + mm_6:strt-end chr3: mm_2:23,133,000-end chr4: mm_3:start-end chr5: mm_1:18,015,577-end chr6: mm_4:45,383,207-end chr7: mm_12:start-end chr8: mm_14:start-end chr9: mm_10:33,415,951-end chr10: mm_9:start-end chr11: mm_13:[66,989,676/67,139,392]-end + mm_15:start-32,682,454 chr12: mm_16:start-end chr13: mm_11:start-31,004,372+mm_5:33,090,849-[109,638,275/110,356,165] chr14: mm_11:31,014,735-end chr15: mm_18:start-end chr16: mm_2:start-23,116,788 + mm_13:start-[66,989,676/67,139,392] chr17: mm_15:32,682,454-end chr18: mm_17:31,811,330-end chr19: mm_7:start-27,524,252+ mm_8:start-75,294,883 chr20: mm_8:75,299,631-end chr21: mm_17:start-27,052,643 + mm_10:start-33,381,753 chr22: mm_1:start-17,966,305 + mm_4:start-45,383,207-end chr23: mm_5:[109,638,275/110,356,165]-end
The above lines refer to the composition of the Music pahari chromosomes. For example, chr1 of Mus pahari is composed of chr7 of Mus musculus starting from the position 27,524,252 until the end and then chr19 of Mus musculus from start to end. This gives a break point from the Mus musculus perspective.
For data points such as [66,989,676/67,139,392], this results from the array CGH being done twice. We have the mean between these two values for the analysis of repeat enrichment in the paper.