Expression and splicing QTLs recomputed from public datasets

Pipelines

Methods

Detailed description of the methods can be found in our flagship paper. Below is a short overview of the main analysis steps.

Gene expression and splicing quantification

The RNA sequencing quantification pipeline is available from the eQTL-Catalogue/rnaseq GitHub repository. The pipeline is based on the nf-core/rnaseq pipeline that we have modified to include the following quantification methods:

  • gene expression: RNA-seq reads were aligned to the GRCh38 reference genome using HISAT2 and reads overlapping GENCODE v30 transcript annoations were counted using featureCounts.
  • exon expression: DEXSeq was used to convert GENCODE v30 transcript annotations to non-overlapping exon annotations. Reads overlapping exons were counted using featureCounts.
  • transcript usage: Salmon was used to estimate the expression levels of all annotated transcripts in GENCODE v30.
  • txrevise event usage: txrevise was used to convert Ensembl 96 transcript annotations to independent promoter, splice junction and 3ʹ end usage events. Salmon was used to estimate the expression levels of those events.

Molecular trait normalisation and quality control

The molecular trait normalisation and quality control workflow is available from the eQTL-Cataloguemoff/qtl_norm_qc GitHub repository.

Normalisation

Briefly, we use the following normalisation strategies:

  • gene counts: Conditional quantile normalisation with cqn using gene length and GC content as covariates.
  • exon counts: Conditional quantile normalisation with cqn using exon length and GC content as covariates.
  • transcript usage: Transcript usage is calculated by dividing the transcript expression estimates (TPM units) the total expression of all transcripts of the same gene. Transcript usage values (0…1 scale) are further standardised using inverse normal transformation.
  • txrevise event usage: Promoter, splice junction and 3ʹ end event usage is calculated by dividing the event expression estimates (TPM units) by the total expression of all events of the same class (promoters, splicing events, 3ʹ end events) within the same gene. Txrevise event usage values (0…1 scale) are further standardised using inverse normal transformation.

Genotype imputation and quality control

For most datasets, we performed basic QC on the raw genotypes calls from genotyping arrays after which we imputed the remaining genotypes using Michigan Imputation Server and 1000 Genomes Phase 3 reference panel.

Before imputation

  • Exclude all variants with minor allele frequence (MAF) < 0.01, Hardy-Weinberg equilibrium p-value < 1e-6 and that were missing in more than 5% of the samples.

After imputation

  • Convert all genetic variants to GRCh38 coordinates with CrossMap.
  • Exclude variants with minor allele frequence (MAF) < 0.01 and imputation quality score (R2) < 0.4.
  • Check that the genotypes in the VCF file are concordant with those detected from the RNA-seq data using qtltools mbv command. Correct sample swaps and discard any samples with missing genotypes or high levels of cross-contamination.

Association testing

The association testing pipeline is available from eQTL-Catalogue/qtlmap GitHub repository. The main analysis steps are:

  • Perform principal component analysis (PCA) of the genotype data with PLINK 1.9.
  • Perfrom PCA analysis of the molecular phenotype data (prcomp function in R).
  • Perform assocation testing with QTLtools.

In association testing, we use the following parameters:

  • Use first six principal components from the genotype data and first six principal components from the molecular trait data as covariates.
  • Test all variants that are +/- 1Mb from the start of the gene (as defined in Ensembl).

Reference annotatons