GLEANER is a GWAS matrix factorization tool to estimate sparse latent pleiotropic genetic factors. Factors map traits to a distribution of SNP effects that may capture biological pathways or mechanisms shared by these traits. This repo contains the gleanr
R package (in development), which we recommend using in conjunction with the gleanr_workflow repository.
The bioRxiv preprint describing the gleanr
method in detail is avaialable here:
This can be done directly from github using the devtools
package as follows:
devtools::install_github("aomdahl/gleanr")
This is an ongoing project to develop a flexible, interpretable, and sparse factorization framework to integrate GWAS data across studies and cohorts. We employ a basic alternating least-squares matrix factoriztion algorithm with sparse priors on learned matrices, while accounting for study uncertainty. Our approach was inspired by work from Yuan He here.
Development of tutorials/vignettes for gleanr
are ongoing. For a basic interactive use case in R
, see the vignette associated with this package. If you'd like to run gleanr
directly from the command line (our recommended use), use the script src/gleanr_run.R
available in the gleanr_workflow repository after installing this package to run analysis directly on input matrices of summary statistics.
To run GLEANR, a user must provide:
- a matrix
$B$ of$N$ SNPs by$M$ studies of GWAS effect sizes (e.g.$\beta$ 's) (required)- Each SNP and trait should have a label, as in the example file here
- an
$N \times M$ matrix of GWAS standard error estimates, with the same order as$B$ (required, example file here) - an
$M \times M$ matrix of estimated correlation due to sample sharing ($C$ ); this may be estimated using LDSC and should have (optional, example file here) - an
$N \times M$ matrix of esitmation error correlation due to sample sharing; this will be used to regularize$C$ (optional, example file here) - an
$M \times 1$ list of trait names corresponding to$M$ (required). This can be used to specify cleaner names for columns in$B$ . These should be unique. - an
$M \times 1$ list of standard deviation estimates across trait Z-scores (optional; only provide if using XT- LDSC to estimate degree of sample sharing)
To review development versions of gleanr prior to the reorgnization of this github in Nov. 2024, please see the gleanr_source_backup
directory in the gleanr_workflow repository.