Joint README file for all eQTL, quantification and genotype files of Lappalainen et al. Nature 2013 Please see GeuvadisRNASeqAnalysisFiles.xls for a summary of eQTL and quantification files. Further details of the file contents and formats are provided in this file. Description of the analysis methods can be found in the Supplementary methods. ----------------------------------------------------------------------- eQTL file set: - Sample set & sample size : EUR373, YRI89 - Quantitative trait: exon, gene, transcript ratio (trratio), transcribed repetitive element (repeat), miRNA (mi) - Set of associations included in the file: All the associations below false discovery rate 5%, or best association per each gene (for exon, transcript ratio) or unit (gene, repeat, miRNA). If there are several best associating variants with the same p-value, one of them has been chosen randomly. eQTL file format: - The file contains variant–QT association information, with the following columns: 1 SNP_ID : Variant identifier according to dbSNP137; position-based identifier for variants that are not in dbSNP (see Supplementary material pp 45) 2 ID : Null (-) 3 GENE_ID : Gene identifier according to Gencode v12, miRBase v18, repeats based on their start site 4 PROBE_ID : Quantitative trait identifier; the same as GENE_ID expect for: Exons: GENEID_ExonStartPosition_ExonEndPosition Transcript ratios: Transcript identifier according to Gencode v12 5 CHR_SNP : Chromosome of the variant 6 CHR_GENE : Chromosome of the quantitative trait 7 SNPpos : Position of the variant 8 TSSpos : Transcription start site of the gene/QT 9 Distance : | SNPpos – TSSpos | 10 rvalue : Spearman rank correlation rho (calculated from linear regression slope) 11 pvalue : Association p-value 12 log10pvalue : -log10 of pvalue ----------------------------------------------------------------------- Quantification file set: - Sample set + sample size : QC-passed: All QC-passed samples including replicates: 660 (mRNA) or 480 (miRNA) QC-passed unique: Nonredundant set of unique samples used in most analyses: 462 (mRNA); 452 (miRNA) - Normalization: None: raw read counts Library depth: Read counts scaled by total number of mapped reads (mRNA), or total number reads mapping to miRNAs (miRNA) per sample, then adjusted to the median of the sample set (45M for mRNA, 1.2M for miRNA) Library depth and transcript length: RPKM Library depth & expressed & PEER: Library depth scaling as above, removal of units with 0 counts in >50% samples, and removal of technical variation by PEER normalization Quantification file format: - The files provide quantification unit (rows) x samples (columns) tables. The header line gives sample IDs, and the four first rows provide information of the quantification unit: 1 TargetID : Quantitative trait identifier; the same as GENE_ID expect for: Exons: GENEID_ExonStartPosition_ExonEndPosition Transcript ratios: Transcript identifier according to Gencode v12 2 Gene_Symbol : Gene identifier according to Gencode v12, miRBase v18, repeats based on their start site 3 Chr : Chromosome 4 Coord : Start site of the element (taking strand into account) ----------------------------------------------------------------------- Genotype file set: - Genotype files are split into chromosomes, and the sites file contains all sites without genotype data. The files include all samples included in RNA-sequencing. The splice score file contains variants that overlap splice sites, their predicted splicing scores, and genotypes. - The annotation information relies on Gencode v12 gene annotation, and regulatory annotation includes only LCL annotations from Ensembl regulatory build. For details, see Supplementary Methods and http://sanabre.net/geuvadis/index.php/Variant_annotation . Genotype file formats: - Genotype and sites files are in VCF 4.1 format (https://github.com/amarcket/vcf_spec) files - For details of the annotations in the INFO field, see http://sanabre.net/geuvadis/index.php/Variant_annotation . For coding variants, SEVERE_IMPACT and SEVERE_GENE are key fields with the most severe impact of this variant across all affected transcripts and genes. - Splice score file follows the general principles of the vcf format, with the following columns 1 CHROM : chromosome 2 POS : the position of the splice site, provided as the first/last position included in the adjacent exon (cf. AStalavista default coordinates) 3 ID : string identifying the splice site uniquely, composed by strand, genomic coordinate, site type symbol (cf. AStalavista conventions) and chromosome ID 4 The reference sequence of the splice site, as obtained by extracting the corresponding sequence stretch from the genome 5 Comma-separated list of the corresponding splice site sequences after applying the corresponding genetic variant(s) to the reference sequence 6 QUAL : missing 7 "q-1000" when the splice site sequence has not observed in the training set (-Infinity, in practice represented by a value << -1000), "PASS" otherwise 8 INFO : MOD: either alternative (ALT) or constitutive (CON) splice site; ALTx :(combinations of) variants that form each alternative variant (same ordering as in ALT column); REF_SCORE: splice scores assigned to the reference sequence; VAR_SCORES: comma-separated list of scores assigned to the corresponding variant(s); the ordering corresponds to the one used for ALT; SNPS: concatenation of all variants considered for the description of this splice site