The HLasso: simultaneous analysis of many SNPs and covariates ------------------------------------------------------------- Contents of this file: 1. Program description 2. Download, Installation and Running 3. Examples 1. Program description ----------------------- The runHLasso program implements the HyperLasso, or HLasso for short, an algorithm that finds approximate modes of a penalised likelihood function for linear or logistic regression, when the penalty function is the Normal-Exponential-Gamma (NEG) probability density. Equivalently, the HLasso finds posterior modes in Bayesian regression with the NEG prior. Because only the posterior mode is found, the HLasso does not implement full Bayesian inference, but the Bayesian language of "prior" and "posterior" can provide intuitive alternatives to "penalty" and "penalised likelihood". The NEG has two parameters, shape and scale. In the limit as both increase such that sqrt(2*shape)/scale is a constant, say lambda, the NEG converges to the Laplace (double exponential) distribution with rate parameter lambda. In this special case the HLasso reduces to the Lasso procedure of: Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, series B 58: 267-88. HLasso is useful when there are many more predictors than observations, because the penalty function can overcome the problem of over-fitting that would undermine any usefulness of standard logistic regression. When the shape parameter is small (say < 1; we find 0.05 to be the smallest practical value, below this computational problems arise), the NEG penalty function has the properties: (a) sharp peak at zero, reflecting a strong "penalty" on any non-zero regression coefficient; and (b) a very high gradient near zero, but heavy (i.e. flattish) tails away from zero, corresponding to little shrinkage effect on any non-zero value. In Bayesian language, this corresponds to a strong prior on effect sizes very close to zero, but little prior information about the magnitude of non-zero effects. In contrast, the Laplace penalty implemented in the Lasso has lighter tails, with more curvature away from the origin, so that non-zero parameter values tend to be strongly "shrunk" towards zero, reflecting a prior belief that non-zero effects are small. This has the undesirable consequence that if there are two strongly-correlated predictors, only one of which is causal, the mode may correspond to including both predictors in the model so that the causal effect is spread over the two, and its size is thus greatly underestimated at the true causal predictor. The HLasso is more inclined to choose one of the two predictors, and unless the correlation between the predictors is perfect this will usually be the true causal predictor. Although it can also solve other regression problems with many predictors, HLasso has been designed to analyse genome-wide genetic association studies (GWAS) in which the predictors are SNP genotypes coded as 0, 1, and 2. The case-control phenotype is the default, but HLasso can also accommodate a continuous phenotype that is modelled using the usual linear regression assumption of Gaussian (i.e. normal) residuals. The situation described above, with highly-correlated predictors, often arises with tightly-linked SNPs in a GWAS. If several such SNPs are associated with phenotype, it could be that there are multiple tightly-linked causal SNPs, or that there is a single causal SNP and the others display association only through LD with it. HLasso can typically distinguish these scenarios automatically, and include in the fitted model either multiple SNPs, or just one, as appropriate. A note on data input matrices: The covariates for the regression analyses are supplied to HLasso in two files, intended to be the genetic (i.e. SNP) and non-genetic covariates respectively; see below for format. However, for many purposes HLasso treats all covariates the same way, and in effect appends the non-genetic covariates after the genetic covariates. The distinction between the two files is that the genetic covariates are stored as short integers, and specific genetic models are applied assuming three distinct values; non-genetic covariates are stored as double-precision reals and are only analysed as linear covariates. A feature of HLasso is that it can cope with up to 1 million SNPs in a few hours of computation time per posterior mode estimate; we typically evaluate 100 mode estimates. Credits and further information: runHLasso was written by Clive Hoggart with input from the other authors of the paper below. Funding came from the UK Medical Research Council. For further details and the results of simulation and real data studies illustrating its power, see: Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ (2008) Simultaneous Analysis of all SNPs in Genome-Wide and Re-sequencing Association Studies, PLoS Genetics 4(7): e1000130; doi:10.1371/journal.pgen.1000130 If this document and the PLoS Genetics paper do not answer your queries, please contact Clive c.hoggart@imperial.ac.uk or David Balding d.balding@imperial.ac.uk 2. Download, Installation and Running ------------------------------------- The HLasso code will compile under the Linux OS. It requires that gsl, gcc and g77 are already installed. First download HLasso.tar.gz from the EBI website. Then to install type: tar -zxf HLasso.tar.gz cd HyperLasso make HLasso is run from the command line. We first list all possible commandline options, then give commands to run three example analyses of the toy data set (gen.dat, cov.dat, phen.dat, cont_phen.dat) provided with the code in directory data. -genotypes Path name for genotypes file. Rows index individuals and columns index SNPs. Genotypes are coded as 0, 1 and 2. First row contains SNP names, but there is no column for individual identifiers. Columns are separated by any white space. Missing values can be coded as NA, 9 or -1. -covariates Path name for file containing non-genetic covariates. Rows index individuals and columns index covariates. All covariates are analysed as linear (factor covariates with more than 2 levels must be coded as multiple binary covariates). Missing values are coded as above. -target Path name for file containing the (single) phenotype variable, can be either row or column. Unless -linear is set, only 0 and 1 values are allowed. If -linear is set, HLasso automatically standardises the phenotype variable. -linear Switch to fit linear regression model (continuous phenotype); default is logistic regression (case-control phenotype). If selected, the model has an additional parameter: the precision tau (=1/variance) of the target variable, conditional on the covariates. Initially tau=1, but is updated to its maximum likelihood estimate after each cycle through the covariates, and the value is output at each iteraction. -iter Number of mode searches. Default is 1. The search procedure is deterministic for a given order of covariates (both genetic and non-genetic), but the mode identified can vary according to this search order. The first iteration uses the search order supplied by the user (genetic covariates then non-genetic covariates, both in the order of input). Subsequent iterations use a random permutation of all covariates. The search order remains fixed through each iteration, i.e. as the algorithm cycles through the covariates until a convergence criterion is met; we set this criterion in the same way as: Genkin A, Lewis DD, Madigan D (2007) Large-scale Bayesian logistic regression for text categorization. Technometrics 49(3): 291-304. -o Path for output file. Output is an R list which can be read into R using the dget command. Each element of the list is a matrix describing a unique mode found by the search. First column records the number of cycles required to find the mode, the log-posterior and the log-likelihood of the mode (the final entry of this column is always 0). Subsequent columns describe the covariates selected in the model, first the non-genetic and then the genetic covariates: row 1 - position (rank) of the covariate in eah of the input files, starting at zero, for nongenetic and then genetic covariates. row 2 - the covariate name supplied in the input files. row 3 - the value of the regression coefficient (recall that continuous phenotypes are standardised by HLasso; to recover coefficient estimates for the original scale, multiply by the phenotype standard deviation). row 4 - for non-genetic covariates, always 0. For genetic covariates, indicator for effect type (i.e. the genetic model): 0 - additive (genotypes coded 0,1,2), 1 - recessive (genotypes coded 0,0,1), 2 - dominant (genotypes coded 0,1,1), 3 - heterozygous (genotypes coded 0,1,0). -dom Switch to include tests for dominant, recessive and heterozygous effects for the genetic covariates (additive effects are always tested). If set, for a SNP that is not currently included in the model, the four effect types shown above are all considered, and only the one with the largest gradient of the log-likelihood at the origin is considered for entry into the model. If included in the model, only the parameter value is updated at subsequent visits of the HLasso to this SNP, and not the effect type; the latter can only change within an iteration by the parameter value reverting to zero, and then a different model being selected, in subsequent cycles. -std Switch to standardise the genetic covariates. Default is not to standardize. -std_d Switch to standardise the non-genetic covariates. Default is not to standardize (the d denotes that these can be double precision). -model Indicates which shrinkage prior is being used for the genetic covariates: 0 NEG (default) 1 DE 2 Gaussian. -model_d As above but for the non-genetic covariates. -shape Shape parameter of NEG prior for genotypes. -shape_d As above but for the non-genetic covariates. -lambda Either the rate parameter of a DE prior, or precision of a normal prior (also called tau, =1/variance). If used with NEG prior, the scale parameter of NEG will be set such that the derivative of the density at the origin will be the same as for a DE distribution with this rate parameter. -lambda_d As above but for the non-genetic covariates. -scale Scale parameter of NEG prior for genotype data, not required if lambda is set. -scale_d As above but for the non-genetic covariates. 3. Examples ------------ These examples use the following data files that are supplied with the code in directory data: gen.dat Genotype SNP data for 1000 individuals at 100 independent SNPs. cov.dat Data for three covariates for the 1000 individuals. phen.dat Binary outcome for the 1000 individuals generated from the following model: P(Affected) = logit( alpha + SNP1 + SNP2/2 + SNP3/4 + SNP4/8 + cov1/2 + cov2/4 +cov3/8 ) (alpha chosen to give approximately equal numbers of cases and controls) cont_phen.dat Continuous outcome for the 1000 individuals generated from the following model: Trait ~ N( SNP1 + SNP2/2 + SNP3/4 + SNP4/8 + cov1/2 + cov2/4 +cov3/8, 1 ) These datasets are not in any sense realistic, and are only used to show how to run the program and interpret its output, not to demonstrate the power of HLasso. Note that # SNPs << # individuals, and so standard logistic regression should work well - we use this for comparison below. Example 1: ./runHLasso -genotypes data/gen.dat -target data/phen.dat -shape 0.1 -lambda 50 -o test.R -std This fits a logistic regression model to the binary trait. Only the standardized SNP covariates are fitted, with an NEG prior assigned to the regression coefficients. The above command generates the following chatter output to the terminal: Shrinkage parameters for SNP data: lambda = 0.1 gamma = 0.018069 penalty = 50 Shrinkage parameters for other data: lambda = 0 gamma = inf penalty = 0 Loading phen.dat Dimensions of target file: 1000 1 Loading gen.dat 1000 100 0 Intercept: -0.619039 Opening output file test.R Iteration: 0 The final line above indicates that only 1 iteration of algorithm is used (default), so only one mode will be reported. The results viewed in R are: R version 2.7.1 (2008-06-23) Copyright (C) 2008 The R Foundation for Statistical Computing ISBN 3-900051-07-0 ..... > dget("test.R") [[1]] [,1] [,2] [,3] [,4] [1,] "12" "0" "2" "0" [2,] "-613.98" "SNP1" "SNP3" "Intercept" [3,] "-627.781" "0.471651" "0.330511" "-1.25968" [4,] "0" "0" "0" "0" So HLasso has identified SNPs 1 and 3 as being associated with the phenotype, and not SNPs 2 and 4; this is similar to the results of a logistic regression which finds that SNPs 1 and 3 are both significant at 10^3, whereas neither 2 nor 4 are significant at 10^-2. Repeating the HLasso command with -iter 100 does not identify any new modes. The (correct) additive model was associated with the two SNPs identified, and the parameter estimates were approximately 0.47 and 0.33. The negative value for intercept indicates fewer cases than controls (here 350 and 650, respectively). Example 2: ./runHLasso -target data/phen.dat -covariates data/cov.dat -shape 0.1 -lambda 50 -o test.R -model_d 2 -lambda_d 0.1 -std -std_d -genotypes data\gen.dat This fits a logistic regression model, but this time with the non-genetic covariates as well as the SNP data, and all covariates are standardised. The SNP regression coefficients are assigned a NEG prior and the covariates are assigned a normal prior. > dget("test.R") [[1]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] "11" "0" "1" "2" "0" "2" "0" [2,] "-585.01" "cov1" "cov2" "cov3" "SNP1" "SNP3" "Intercept" [3,] "-589.028" "0.512079" "0.34897" "0.097446" "0.482773" "0.3614" "-1.29689" [4,] "0" "0" "0" "0" "0" "0" "0" So all three covariates are correctly found to be associated with the phenotype, with decreasing effect sizes. SNPs 1 and 3 are again found to be associated with the phenotype and, perhaps because of the residual variation explained by the covariates, their effect size estimates are now slightly higher. Example 3: ./runHLasso -target data/cont_phen.dat -covariates data/cov.dat -shape .1 -lambda 100 -o test.R -lambda_d 50 -shape 0.1 -genotypes data\gen.dat -linear This fits a linear regression model to the continuous trait using non-standardized SNP and covariate data. Both types of data are assigned an NEG prior, but with different parameters.