Most known genetic variation in human genomes has been called from comparison of short reads to the reference genome, an approach biased against... Show More
Most known genetic variation in human genomes has been called from comparison of short reads to the reference genome, an approach biased against finding complex variation. We sequenced 150 individuals from 50 parent-offspring trios with multiple insert-size libraries to very high coverage. We show that each genome could be independently de novo assembled into a small number of high-quality scaffolds (median N50 > 21 Mb), each of quality comparable to long read assemblies while being very cost-effective. We show that our variant call set from comparing de novo assemblies is far more complete in terms of complex variation than previous studies. Importantly, even the complex 4-5 Mb extended MHC region was assembled and resolved into haplotypes, revealing >700kb novel sequence in this important region of the genome, and major parts of the Y chromosome including some palindromes were assembled with high accuracy. Finally, we show that our variant call-set allows for the genotyping of many more complex variants when used as a reference-panel for imputation into SNP-chip data or into previously resequenced genomes.
Alternative Stable ID
This study includes 5 datasets:
Click on a Dataset Accession in the table below to learn more, and to find out who to contact about access to these data
Alignment of Genome Denmark Phase II dataset to GRCh38. The dataset consists of 150 Danish individuals (50 trios) sequenced to 80X. The BAM-file contains data from multiple libraries created from one individual with libraries of 180, 500, 800, 2000, 5000, 10000 and 20000 bp. The libraries were created using standard Illumina protocols for paired end reads (180-800bp libraries) and mate pair libraries (2kb-20kb).
Variants on the Y chromosome for 62 danish males in VCF format from the GenomeDenmark Phase 2 cohort. Variants were called using reference based approaches such as the haplotype-caller module from GATK and using alignment of denovo assemblies to the reference using ASMvar.
Variants and genotypes called in 50 danish parent-offspring trios from 80x Illumina sequencing data using BayesTyper. Data was produced using different insert size libraries of the sizes 180, 500, 800, 2000, 5000, 10000 and 20000 bp. The sample IDs for the fathers and mothers are TrioID-01 and TrioID-02, respectively, and the IDs for the children are TrioID-0x, where x is a number between 3 and 7
The MHC vcf call set was generated using a modified AsmVar and BayesTyper pipeline. In contrast to the original pipeline, where variant calling is performed using alignment of collapsed assemblies to a reference genome, the MHC call set was produced using alignment of phased MHC haplotypes. Two iterations of BayesTyper was run, a first iteration for each haplotype seperately and a second iteration performing joint variant calling on all haplotypes. The sample IDs for the fathers and mothers are TrioID-01 and TrioID-02, respectively, and the IDs for the children are TrioID-0x, where x is a number between 3 and 7.