diff --git a/DESCRIPTION b/DESCRIPTION index 055f40a..a5718f9 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,16 +1,17 @@ Package: EPIC Type: Package Title: Estimate the Proportion of Immune and Cancer cells -Version: 1.0.5 +Version: 1.0.6 Authors@R: as.person(c( "Julien Racle [aut, cre]", "David Gfeller [aut]" )) Description: Package implementing EPIC method to estimate the proportion of - immune and cancer cells from bulk gene expression data. It is based on - reference gene expression profiles for the main immune cell types and it - predicts the proportion of these cells and of the remaining "other cells" - that are the remaining cells (cancer, stromal, endothelial) for which no + immune, stromal, endothelial and cancer or other cells from bulk gene + expression data. + It is based on reference gene expression profiles for the main non-malignant + cell types and it predicts the proportion of these cells and of the + remaining "other cells" (that are mostly cancer cells) for which no reference profile is given. Depends: R (>= 3.2.0) License: file LICENSE diff --git a/NEWS b/NEWS index 5597373..8c2fc6f 100644 --- a/NEWS +++ b/NEWS @@ -1,3 +1,10 @@ +Version 1.0.6 +------------------------------------------------------------------------ +* Updated EPIC to use the latest reference profiles and updated some default + values and tests for the package. +* Updated the help files accordingly, explaining also the additional advanced + options available for EPIC. + Version 1.0.5 ------------------------------------------------------------------------ * Updated EPIC to remove genes from bulk that have only NA values. @@ -11,7 +18,7 @@ Version 1.0.4 release the constrain that the sum of cells with known reference profiles must be smaller than 1. Also if withOtherCells is FALSE and constrainedSum is FALSE, we don't use the constraint during the optimization that the sum of all - the cells must also be equal to 0 (note we still rescale afterwards the + the cells must also be equal to 1 (note we still rescale afterwards the proportions to have a sum equal to 1 but it isn't done during the optimization). @@ -40,8 +47,8 @@ Version 1.0.1 the p-value isn't exact in the presence of ties. * Added mRNA_cell values for T cell subtypes. * Added reference profiles including CD4 and CD8 T cells for circulating - immune cells (BRef_s, BRef_s.tpm) and including also CAFs and endothelial - cells for tumor infilatring cells (TRef_s.tpm). + immune cells and including also CAFs and endothelial cells for tumor + infilatring cells. Version 1.0.0 diff --git a/R/EPIC_descr.R b/R/EPIC_descr.R index 92e5745..e4325c5 100644 --- a/R/EPIC_descr.R +++ b/R/EPIC_descr.R @@ -1,23 +1,22 @@ #' EPIC: a package to Estimate the Proportion of Immune and Cancer cells from #' tumor gene expression data. #' -#' EPIC package provides the function and immune cell reference profiles to -#' estimate the proportion of immune and other cells from bulk gene expression -#' data. +#' EPIC package provides the function and cell reference profiles to +#' estimate the proportion of immune, stromal, endothelial and cancer or other +#' cells from bulk gene expression data. #' #' @section EPIC functions: #' \code{\link{EPIC}} is the main function to call to estimate the #' various cells proportions from a bulk sample. #' #' @section Included datasets: -#' \code{\link{BRef}}, \code{\link{BRef.tpm}}: reference profiles from -#' circulating immune cells. +#' \code{\link{BRef}}: reference profiles from circulating immune cells. #' -#' \code{\link{TRef.tpm}}: reference profiles from immune cells obtained -#' from single cell data of tumor infiltrating cells from melanoma patients. +#' \code{\link{TRef}}: reference profiles from tumor infiltrating non-malignant +#' cells obtained from single cell data of melanoma patients. #' -#' \code{\link{hoek_data}}: example dataset containing data from Hoek et al, -#' 2015, PLoS One. +#' \code{\link{melanoma_data}}: example dataset containing data from lymph nodes +#' from patients with metastatic melanoma. #' #' \code{\link{mRNA_cell_default}}: values of mRNA per cell for the main cell #' types. @@ -34,20 +33,17 @@ NULL #' Reference profiles from circulating immune cells. #' #' A dataset containing the reference profiles obtained from immune cell -#' samples of \emph{B-}, \emph{NK-}, \emph{T-cells}, \emph{Monocytes} -#' and \emph{Neutrophils} purified from PBMC or whole blood. +#' samples of \emph{B cells}, \emph{CD4 T cells}, \emph{CD8 T cells}, +#' \emph{Monocytes}, \emph{NK cells} and \emph{Neutrophils}, purified from +#' PBMC or whole blood. #' #' The original samples were obtained from healthy donors and donors after #' influenza vaccination or with diabetes, sepsis or multiple sclerosis. #' -#' @section Similar datasets: -#' \code{BRef} (main reference profiles, using data from sources 1-3 below) -#' -#' #' @format A list of 3 elements: \describe{ \item{$refProfiles, -#' $refProfiles.var}{Matrices (nGenes x nRefCells) of the normalized gene -#' expression from the reference cells and the variability of this gene -#' expression for each gene and each cell type} \item{$sigGenes}{A list of 100 +#' $refProfiles.var}{Matrices (nGenes x nRefCells) of the gene expression (in +#' TPM counts) from the reference cells and the variability of this gene +#' expression for each gene and each cell type} \item{$sigGenes}{A list of #' signature genes used to deconvolve the cell proportions} } #' #' @source \enumerate{ @@ -57,32 +53,32 @@ NULL #' } "BRef" -#' @section Similar datasets: -#' \code{BRef.tpm} (reference profiles based on same data as \code{BRef}, -#' but given in TPM counts instead of raw counts) -#' @rdname BRef -"BRef.tpm" #' Reference profiles obtained from single cell data of tumor infiltrating #' cells. #' -#' A dataset containing the reference profiles given in TPM from various cell -#' types: \emph{B-}, \emph{NK-}, \emph{T-cells} and \emph{Macrophages}. +#' A dataset containing the reference profiles obtained from various +#' tumor infiltrating non-malignant cell types: \emph{B cells}, +#' \emph{cancer-associated fibroblasts}, \emph{CD4 T cells}, \emph{CD8 T cells}, +#' \emph{endothelial cells}, \emph{Macrophages} and \emph{NK cells}. #' #' These were obtained from single-cell RNA-seq data from 9 donors from -#' the publication of Tirosh et al., 2016, Science. The samples -#' come from cancer metastases of melanoma (extracted from primary tumors -#' and non-lymphoid tissue metastases). The classification for each sample with -#' respect to each cell type is the one given by Tirosh et al. +#' the publication of \cite{Tirosh et al., 2016, Science}. The samples +#' come from melanoma tumors (extracted from primary tumors and non-lymphoid +#' tissue metastases). The classification for each sample with +#' respect to each cell type is the one given by Tirosh et al., except for +#' the CD4 T cells and CD8 T cells, that were identified from the T cells based +#' on the expression of CD4, CD8A and CD8B as described in \cite{EPIC} +#' publication. #' #' @format A list of 3 elements: \describe{ \item{$refProfiles, -#' $refProfiles.var}{Matrices (nGenes x nRefCells) of the normalized gene -#' expression from the reference cells and the variability of this gene -#' expression for each gene and each cell type} \item{$sigGenes}{A list of 80 +#' $refProfiles.var}{Matrices (nGenes x nRefCells) of the gene expression (in +#' TPM counts) from the reference cells and the variability of this gene +#' expression for each gene and each cell type} \item{$sigGenes}{A list of #' signature genes used to deconvolve the cell proportions} } #' #' @source \url{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72056} -"TRef.tpm" +"TRef" #' Values of mRNA / cell for the main cell types. #' @@ -99,28 +95,32 @@ NULL #' \code{mRNA_cell_default} vector. "mRNA_cell_default" -#' Example dataset containing data from Hoek et al, 2015, PLoS One. +#' Example dataset containing data from lymph nodes from patients with +#' metastatic melanoma. #' -#' This dataset contains a subset of the full Hoek et al data. It contains only -#' the data from the two healthy donors PBMC before influenza vaccination. +#' This is the dataset obtained for \cite{EPIC} publication. It contains the +#' gene expression from lymph node samples from four patients with melanoma, and +#' it contains also the proportions of the main immune cell types and of +#' melanoma cells, as measured by FACS. #' #' @format This is a list of 3 elements: \describe{ -#' \item{$rawCounts}{(matrix of 51574 genes x 2 donors) The raw read counts -#' from the two donors. It has been obtained by mapping the original -#' fastq files to \emph{hg19} genome with help of \emph{tophat} and -#' \emph{htseq-count}.} -#' \item{$cellFractions.obs}{(matrix of 2 donors x 6 cell types) The -#' proportions of the different immune cells in the PBMC from the 2 donors, -#' as measured by FACS by Hoek et al.} -#' \item{$cellFractions.pred}{(matrix of 2 donors x 7 cell types) The -#' proportions of the different immune cells and of a potential additional -#' uncharacterized cell, as predicted by EPIC based on the rawCounts and -#' the profiles \code{reference=BRef}.} +#' \item{$counts}{(matrix of 49902 genes x 4 donors) The TPM normalized counts +#' from the four donors. It has been obtained by mapping RNA-seq data to +#' \emph{hg19} genome with help of \emph{RSEM}. Ensembl ID were then +#' converted to gene names, and genes with duplicated entries were +#' merged together by summing their counts.} +#' \item{$cellFractions.obs}{(matrix of 4 donors x 6 cell types) The +#' proportions of the different cell types measured by FACS (the +#' "other_cells" correspond to the live cells without any marker of the +#' other given cell types).} +#' \item{$cellFractions.pred}{(matrix of 4 donors x 8 cell types) The +#' proportions of the different cell types, as predicted by EPIC based on +#' the reference profiles \code{TRef}.} #' } #' #' @source The description of this data can be found here: -#' \href{http://dx.doi.org/10.1371/journal.pone.0118528}{link to paper} -#' and \href{https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-64655}{link +#' \href{http://www.biorxiv.org/content/early/2017/03/17/117788}{link to paper} +#' and \href{https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93722}{link #' to data}. -"hoek_data" +"melanoma_data" diff --git a/R/EPIC_fun.R b/R/EPIC_fun.R index 8493c1f..eb883fa 100644 --- a/R/EPIC_fun.R +++ b/R/EPIC_fun.R @@ -19,26 +19,26 @@ #' gene signature list need to be the same format (gene symbols are used in the #' predefined reference profiles). The full list of gene names don't need to be #' exactly the same between the reference and bulk samples: \emph{EPIC} -#' will use the intersect of the genes. +#' will use the intersection of the genes. #' -#' @param bulk A matrix (\code{nGenes} x \code{nSamples}) of the raw genes -#' expression from each bulk sample. This matrix needs to have rownames -#' telling the gene names. In principle given as raw read counts, but it -#' could be given in tpm as well if using corresponding reference profiles -#' normalized in tpm as well. +#' @param bulk A matrix (\code{nGenes} x \code{nSamples}) of the genes +#' expression from each bulk sample (the counts should be given in TPM +#' or RPKM when using the prebuilt reference profiles). This matrix needs to +#' have rownames telling the gene names (corresponds to the gene symbol in +#' the prebuilt reference profiles (e.g. CD8A, MS4A1) - no conversion of IDs +#' is performed at the moment). #' @param reference (optional): A string or a list defining the reference cells. #' It can take multiple formats, either: \itemize{ #' \item\code{NULL}: to use the default reference profiles and genes -#' signature \code{\link{BRef}}. -#' \item a char: one of \emph{"BRef"}, \emph{"BRef.tpm"} or \emph{"TRef.tpm"} +#' signature \code{\link{TRef}}. +#' \item a char: either \emph{"BRef"} or \emph{"TRef"} #' to use the reference cells and genes signature of the corresponding -#' datasets (see \code{\link{BRef}}, \code{\link{BRef.tpm}} and -#' \code{\link{TRef.tpm}}). +#' datasets (see \code{\link{BRef}} and \code{\link{TRef}}). #' \item a list. When a list it should include: \describe{ #' \item{\code{$refProfiles}}{a matrix (\code{nGenes} x \code{nCellTypes}) #' of the reference cells genes expression (without the cancer cell type); #' the rownames needs to be defined as well as the colnames giving the -#' names of each reference cell types; +#' names of each gene and reference cell types respectively; #' } #' \item{\code{$sigGenes}}{a character vector of the gene names to use as #' signature - sigGenes can also be given as a direct input to EPIC @@ -63,11 +63,8 @@ #' we would use a value of 3.5 for the "otherCells" that didn't have any #' reference profile and a default value of 1 for the Tcells when computing #' the cell fractions). -#' To note: if the data is given as raw counts, then mRNA per cell should -#' correspond to some weight of mRNA per cell (or to a total number of mRNA -#' nucleotides per cell); while if data is in tpm, then this mRNA per cell -#' would ideally correspond more to some number of transcripts per cell. -#' The default values correspond to data in raw counts. +#' To note: if data is in tpm, this mRNA per cell would ideally correspond +#' to some number of transcripts per cell. #' @param mRNA_cell_sub (optional): This can be given instead of \code{mRNA_cell} (or #' in addition to it). It is also a named numeric vector, used to replace #' only the mRNA/cell values from some cell types (or to add values for new @@ -78,6 +75,30 @@ #' reference as the "reference$sigGenes" but if we give a value for this #' input variable, it is these signature genes that will be used instead of #' the ones given with the reference profile. +#' @param scaleExprs (optional, default is TRUE): boolean telling if the bulk +#' samples and reference gene expression profiles should be rescaled based on +#' the list of genes in common between the them (such a rescaling is +#' recommanded). +#' @param withOtherCells (optional, default is TRUE): if EPIC should allow for +#' an additional cell type for which no gene expression reference profile is +#' available or if the bulk is assumed to be composed only of the cells with +#' reference profiles. +#' @param constrainedSum (optional, default is TRUE): tells if the sum of all +#' cell types should be constrained to be < 1. When +#' \code{withOtherCells=FALSE}, there is additionally a constrain the the sum +#' of all cell types with reference profiles must be > 0.99. +#' @param rangeBasedOptim (optional): when this is FALSE (the default), the +#' least square optimization is performed as described in \cite{EPIC} +#' paper, which is recommanded. +#' When this variable is TRUE, EPIC uses the variability of each gene +#' from the reference profiles in another way: instead of defining weights +#' (based on the variability) for the fit of each gene, we define a range of +#' values accessible for each gene (based on the gene expression value in +#' the reference profile +/- the variability values). The +#' error that the optimization tries to minimize is by how much +#' the predicted gene expression is outside of this allowed range of values. +#' +#' #' @return A list of 3 matrices:\describe{ #' \item{\code{mRNAProportions}}{(\code{nSamples} x (\code{nCellTypes+1})) the #' proportion of mRNA coming from all cell types with a ref profile + the @@ -93,31 +114,28 @@ #' } #' #' @examples -#' res1 <- EPIC(hoek_data$rawCounts) +#' res1 <- EPIC(melanoma_data$counts) #' res1$cellFractions -#' res2 <- EPIC(hoek_data$rawCounts, BRef) -#' res3 <- EPIC(bulk=hoek_data$rawCounts, reference=BRef) -#' res4 <- EPIC(hoek_data$rawCounts, reference="BRef") -#' res5 <- EPIC(hoek_data$rawCounts, mRNA_cell_sub=c(Bcells=1, otherCells=5)) -#' res6 <- EPIC(bulk=hoek_data$rawCounts, reference="BRef.tpm") +#' res2 <- EPIC(melanoma_data$counts, TRef) +#' res3 <- EPIC(bulk=melanoma_data$counts, reference=TRef) +#' res4 <- EPIC(melanoma_data$counts, reference="TRef") +#' res5 <- EPIC(melanoma_data$counts, mRNA_cell_sub=c(Bcells=1, otherCells=5)) #' # Various possible ways of calling EPIC function. res 1 to 4 should #' # give exactly the same outputs, and the elements res1$cellFractions #' # should be equal to the example predictions found in -#' # hoek_data$cellFractions.pred for these first 4 results. +#' # melanoma_data$cellFractions.pred for these first 4 results. #' # The values of cellFraction for res5 will be different due to the use of #' # other mRNA per cell values for the B and other cells. -#' # And res6 will also give different results and is not advised: the reference -#' # BRef.tpm corresponds to tpm while the bulk was given as raw counts. #' #' @export EPIC <- function(bulk, reference=NULL, mRNA_cell=NULL, mRNA_cell_sub=NULL, - sigGenes=NULL, minFunStr="minFun1", scaleExprs=TRUE, - withOtherCells=TRUE, constrainedSum=TRUE){ + sigGenes=NULL, scaleExprs=TRUE, withOtherCells=TRUE, + constrainedSum=TRUE, rangeBasedOptim=FALSE){ # First get the value of the reference profiles depending on the input # 'reference'. with_w <- TRUE if (is.null(reference)){ - reference <- EPIC::BRef + reference <- EPIC::TRef } else if (is.character(reference)){ if (reference %in% prebuiltRefNames){ reference <- get(reference, pos="package:EPIC") @@ -192,14 +210,13 @@ EPIC <- function(bulk, reference=NULL, mRNA_cell=NULL, mRNA_cell_sub=NULL, " matching common genes between bulk and reference profiles,", " but there should be more signature genes than reference cells") - if (length(commonGenes) < 2e3) - warning("there are few genes in common between the bulk samples and ", - "reference cells:", length(commonGenes), ", so the normalization ", - "might be an issue") - # The value of 2e3 is arbitrary, but should be a respectable number for the - # data renormalization. - if (scaleExprs){ + if (length(commonGenes) < 2e3) + warning("there are few genes in common between the bulk samples and ", + "reference cells:", length(commonGenes), ", so the data scaling ", + "might be an issue") + # The value of 2e3 is arbitrary, but should be a respectable number for the + # data renormalization. bulk <- scaleCounts(bulk, sigGenes, commonGenes)$counts temp <- scaleCounts(refProfiles, sigGenes, commonGenes) refProfiles <- temp$counts @@ -223,15 +240,14 @@ EPIC <- function(bulk, reference=NULL, mRNA_cell=NULL, mRNA_cell_sub=NULL, mRNA_cell[names(mRNA_cell_sub)] <- mRNA_cell_sub } - minFunList <- list() - minFunList$minFun1 <- function(x, A, b, w){ + minFun <- function(x, A, b, w){ # Basic minimization function used to minimize the squared sum of the error # between the fit and observed value (A*x - b). We also give a weight, w, # for each gene to give more or less importance to the fit of each (can use # a value 1 if don't want to give any weight). return(sum( (w * (A %*% x - b)^2 ), na.rm = TRUE)) } - minFunList$minFun.range <- function(x, A, b, A.var){ + minFun.range <- function(x, A, b, A.var){ # Other minimization function where we don't use weights but instead give # also the variability on A as input and we'll compute for each gene the # min / max value of the pred based on this variability to keep the smallest @@ -248,7 +264,7 @@ EPIC <- function(bulk, reference=NULL, mRNA_cell=NULL, mRNA_cell_sub=NULL, return(sum(cErr, na.rm = TRUE)) } - if (with_w){ + if (with_w && !rangeBasedOptim){ # Computing the weight to give to each gene w <- rowSums(refProfiles / (refProfiles.var + 1e-12), na.rm=TRUE) # 1e-12 to avoid divisions by 0: like this if refProfiles and refProfiles.var @@ -284,17 +300,14 @@ EPIC <- function(bulk, reference=NULL, mRNA_cell=NULL, mRNA_cell_sub=NULL, # equal. We added the -1e-5 in cInitProp because the optimizer needs to have # the initial guess inside the admissible region and not on its boundary - minFun <- minFunList[[minFunStr]] - # minFun <- minFun1 - # Estimating for each sample the proportion of the mRNA per cell type. tempPropPred <- lapply(1:nSamples, FUN=function(cSample){ b <- bulk[,cSample] - if (minFunStr != "minFun.range"){ + if (!rangeBasedOptim){ fit <- stats::constrOptim(theta = rep(cInitProp, nRefCells), f=minFun, grad=NULL, ui=ui, ci=ci, A=refProfiles, b=b, w=w) } else { - fit <- stats::constrOptim(theta = rep(cInitProp, nRefCells), f=minFun, + fit <- stats::constrOptim(theta = rep(cInitProp, nRefCells), f=minFun.range, grad=NULL, ui=ui, ci=ci, A=refProfiles, b=b, A.var=refProfiles.var) } fit$x <- fit$par @@ -319,29 +332,24 @@ EPIC <- function(bulk, reference=NULL, mRNA_cell=NULL, mRNA_cell_sub=NULL, } regLine <- stats::lm(b_estimated ~ b) regLine_through0 <- stats::lm(b_estimated ~ b+0) - if (minFunStr != "minFun.range"){ - gof <- data.frame(fit$convergence, ifelse(is.null(fit$message), "", fit$message), - sqrt(minFun(x=fit$x, A=refProfiles, b=b, w=w)/nSigGenes), - sqrt(minFun(x=rep(0,nRefCells), A=refProfiles, b=b, w=w)/nSigGenes), - # to only have sum((w*b)^2) or corresponding value based - # on the minFun, in the worst case possible. - corSp.test$estimate, corSp.test$p.value, - corPear.test$estimate, corPear.test$p.value, - regLine$coefficients[2], regLine$coefficients[1], - regLine_through0$coefficients[1], sum(fit$x), - stringsAsFactors=F) + if (!rangeBasedOptim){ + rmse_pred <- sqrt(minFun(x=fit$x, A=refProfiles, b=b, w=w)/nSigGenes) + rmse_0 <- sqrt(minFun(x=rep(0,nRefCells), A=refProfiles, b=b, w=w)/nSigGenes) } else { - gof <- data.frame(fit$convergence, ifelse(is.null(fit$message), "", fit$message), - sqrt(minFun(x=fit$x, A=refProfiles, b=b, A.var=refProfiles.var)/nSigGenes), - sqrt(minFun(x=rep(0,nRefCells), A=refProfiles, b=b, A.var=refProfiles.var)/nSigGenes), - # to only have sum((w*b)^2) or corresponding value based - # on the minFun, in the worst case possible. - corSp.test$estimate, corSp.test$p.value, - corPear.test$estimate, corPear.test$p.value, - regLine$coefficients[2], regLine$coefficients[1], - regLine_through0$coefficients[1], sum(fit$x), - stringsAsFactors=F) + rmse_pred <- sqrt(minFun.range(x=fit$x, A=refProfiles, b=b, + A.var=refProfiles.var)/nSigGenes) + rmse_0 <- sqrt(minFun.range(x=rep(0,nRefCells), A=refProfiles, b=b, + A.var=refProfiles.var)/nSigGenes) } + gof <- data.frame(fit$convergence, ifelse(is.null(fit$message), "", fit$message), + rmse_pred, rmse_0, + # to only have sum((w*b)^2) or corresponding value based + # on the minFun, in the worst case possible. + corSp.test$estimate, corSp.test$p.value, + corPear.test$estimate, corPear.test$p.value, + regLine$coefficients[2], regLine$coefficients[1], + regLine_through0$coefficients[1], sum(fit$x), + stringsAsFactors=F) return(list(mRNAProportions=fit$x, fit.gof=gof)) } ) @@ -358,7 +366,7 @@ EPIC <- function(bulk, reference=NULL, mRNA_cell=NULL, mRNA_cell_sub=NULL, # Some matrix giving information on the goodness of the fit. if (any(fit.gof$convergeCode != 0)) - warning("The optimization didn't converge for some samples:\n", + warning("The optimization didn't fully converge for some samples:\n", paste(samplesNames[fit.gof$convergeCode!=0], collapse="; "), "\n - check fit.gof for the convergeCode and convergeMessage") diff --git a/R/sysdata.rda b/R/sysdata.rda index 76e61bf..3d56927 100644 Binary files a/R/sysdata.rda and b/R/sysdata.rda differ diff --git a/README.Rmd b/README.Rmd index b438aaa..b4e02cc 100644 --- a/README.Rmd +++ b/README.Rmd @@ -15,14 +15,17 @@ knitr::opts_chunk$set( # EPIC - package ## Description -Package implementing EPIC method to estimate the proportion of immune and cancer -cells from bulk gene expression data. It is based on reference gene expression -profiles for the main immune cell types and it predicts the proportion of these -cells and of the remaining "other cells" that are the remaining cells (cancer, -stromal, endothelial) for which no reference profile is given. +Package implementing EPIC method to estimate the proportion of immune, stromal, +endothelial and cancer or other cells from bulk gene expression data. +It is based on reference gene expression profiles for the main non-malignant +cell types and it predicts the proportion of these cells and of the remaining +"other cells" (that are mostly cancer cells) for which no reference profile is +given. ## Usage -The main function in this package is `EPIC`. It needs as input a matrix of the raw gene expression counts from the samples for which to estimate cell proportions. One can also define the reference cells to use +The main function in this package is `EPIC`. It needs as input a matrix of the +TPM (or RPKM) gene expression from the samples for which to estimate cell +proportions. One can also define the reference cells to use ```{r, eval = FALSE} out <- EPIC(bulk = bulkSamplesMatrix) out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList) diff --git a/data/BRef.rda b/data/BRef.rda index 526af21..021bef0 100644 Binary files a/data/BRef.rda and b/data/BRef.rda differ diff --git a/data/BRef.tpm.rda b/data/BRef.tpm.rda deleted file mode 100644 index 5815ea0..0000000 Binary files a/data/BRef.tpm.rda and /dev/null differ diff --git a/data/BRef_s.rda b/data/BRef_s.rda deleted file mode 100644 index 540bbf5..0000000 Binary files a/data/BRef_s.rda and /dev/null differ diff --git a/data/BRef_s.tpm.rda b/data/BRef_s.tpm.rda deleted file mode 100644 index 51aa7e4..0000000 Binary files a/data/BRef_s.tpm.rda and /dev/null differ diff --git a/data/TRef.rda b/data/TRef.rda new file mode 100644 index 0000000..f803061 Binary files /dev/null and b/data/TRef.rda differ diff --git a/data/TRef.tpm.rda b/data/TRef.tpm.rda deleted file mode 100644 index a47ef7f..0000000 Binary files a/data/TRef.tpm.rda and /dev/null differ diff --git a/data/TRef_s.tpm.rda b/data/TRef_s.tpm.rda deleted file mode 100644 index 264f336..0000000 Binary files a/data/TRef_s.tpm.rda and /dev/null differ diff --git a/data/hoek_data.rda b/data/hoek_data.rda deleted file mode 100644 index c9fd0b1..0000000 Binary files a/data/hoek_data.rda and /dev/null differ diff --git a/data/melanoma_data.rda b/data/melanoma_data.rda new file mode 100644 index 0000000..c6eecdc Binary files /dev/null and b/data/melanoma_data.rda differ diff --git a/man/BRef.Rd b/man/BRef.Rd index be379f0..40b2988 100644 --- a/man/BRef.Rd +++ b/man/BRef.Rd @@ -3,12 +3,11 @@ \docType{data} \name{BRef} \alias{BRef} -\alias{BRef.tpm} \title{Reference profiles from circulating immune cells.} \format{A list of 3 elements: \describe{ \item{$refProfiles, - $refProfiles.var}{Matrices (nGenes x nRefCells) of the normalized gene - expression from the reference cells and the variability of this gene - expression for each gene and each cell type} \item{$sigGenes}{A list of 100 + $refProfiles.var}{Matrices (nGenes x nRefCells) of the gene expression (in + TPM counts) from the reference cells and the variability of this gene + expression for each gene and each cell type} \item{$sigGenes}{A list of signature genes used to deconvolve the cell proportions} }} \source{ \enumerate{ @@ -19,25 +18,15 @@ } \usage{ BRef - -BRef.tpm } \description{ A dataset containing the reference profiles obtained from immune cell -samples of \emph{B-}, \emph{NK-}, \emph{T-cells}, \emph{Monocytes} -and \emph{Neutrophils} purified from PBMC or whole blood. +samples of \emph{B cells}, \emph{CD4 T cells}, \emph{CD8 T cells}, +\emph{Monocytes}, \emph{NK cells} and \emph{Neutrophils}, purified from +PBMC or whole blood. } \details{ The original samples were obtained from healthy donors and donors after influenza vaccination or with diabetes, sepsis or multiple sclerosis. } -\section{Similar datasets}{ - - \code{BRef} (main reference profiles, using data from sources 1-3 below) - - -\code{BRef.tpm} (reference profiles based on same data as \code{BRef}, - but given in TPM counts instead of raw counts) -} - \keyword{datasets} diff --git a/man/EPIC.Rd b/man/EPIC.Rd index 13e8776..0f71ce3 100644 --- a/man/EPIC.Rd +++ b/man/EPIC.Rd @@ -5,36 +5,39 @@ \title{Estimate the proportion of immune and cancer cells.} \usage{ EPIC(bulk, reference = NULL, mRNA_cell = NULL, mRNA_cell_sub = NULL, - sigGenes = NULL) + sigGenes = NULL, scaleExprs = TRUE, withOtherCells = TRUE, + constrainedSum = TRUE, rangeBasedOptim = FALSE) } \arguments{ -\item{bulk}{A matrix (\code{nGenes} x \code{nSamples}) of the raw genes -expression from each bulk sample. This matrix needs to have rownames -telling the gene names. In principle given as raw read counts, but it -could be given in tpm as well if using corresponding reference profiles -normalized in tpm as well.} +\item{bulk}{A matrix (\code{nGenes} x \code{nSamples}) of the genes +expression from each bulk sample (the counts should be given in TPM +or RPKM when using the prebuilt reference profiles). This matrix needs to +have rownames telling the gene names (corresponds to the gene symbol in +the prebuilt reference profiles (e.g. CD8A, MS4A1) - no conversion of IDs +is performed at the moment).} \item{reference}{(optional): A string or a list defining the reference cells. It can take multiple formats, either: \itemize{ \item\code{NULL}: to use the default reference profiles and genes - signature \code{\link{BRef}}. - \item a char: one of \emph{"BRef"}, \emph{"BRef.tpm"} or \emph{"TRef.tpm"} + signature \code{\link{TRef}}. + \item a char: either \emph{"BRef"} or \emph{"TRef"} to use the reference cells and genes signature of the corresponding - datasets (see \code{\link{BRef}}, \code{\link{BRef.tpm}} and - \code{\link{TRef.tpm}}). + datasets (see \code{\link{BRef}} and \code{\link{TRef}}). \item a list. When a list it should include: \describe{ \item{\code{$refProfiles}}{a matrix (\code{nGenes} x \code{nCellTypes}) of the reference cells genes expression (without the cancer cell type); the rownames needs to be defined as well as the colnames giving the - names of each reference cell types; + names of each gene and reference cell types respectively; } \item{\code{$sigGenes}}{a character vector of the gene names to use as signature - sigGenes can also be given as a direct input to EPIC function;} \item{\code{$refProfiles.var}}{(optional): a matrix (\code{nGenes} x \code{nCellTypes}) of the variability of each gene expression for each - cell type (if absent, we assume an identical variability for all genes - in all cells) - it needs to have the same dimnames than refProfiles;} + cell type, which is used to define weights on each gene for the + optimization (if this is absent, we assume an identical variability + for all genes in all cells) - it needs to have the same dimnames than + refProfiles;} } }} @@ -50,11 +53,8 @@ default=1), then if the refProfiles described Bcells, NKcells and Tcells, we would use a value of 3.5 for the "otherCells" that didn't have any reference profile and a default value of 1 for the Tcells when computing the cell fractions). -To note: if the data is given as raw counts, then mRNA per cell should -correspond to some weight of mRNA per cell (or to a total number of mRNA -nucleotides per cell); while if data is in tpm, then this mRNA per cell -would ideally correspond more to some number of transcripts per cell. -The default values correspond to data in raw counts.} +To note: if data is in tpm, this mRNA per cell would ideally correspond +to some number of transcripts per cell.} \item{mRNA_cell_sub}{(optional): This can be given instead of \code{mRNA_cell} (or in addition to it). It is also a named numeric vector, used to replace @@ -67,6 +67,32 @@ signature for the deconvolution. In principle this is given with the reference as the "reference$sigGenes" but if we give a value for this input variable, it is these signature genes that will be used instead of the ones given with the reference profile.} + +\item{scaleExprs}{(optional, default is TRUE): boolean telling if the bulk +samples and reference gene expression profiles should be rescaled based on +the list of genes in common between the them (such a rescaling is +recommanded).} + +\item{withOtherCells}{(optional, default is TRUE): if EPIC should allow for +an additional cell type for which no gene expression reference profile is +available or if the bulk is assumed to be composed only of the cells with +reference profiles.} + +\item{constrainedSum}{(optional, default is TRUE): tells if the sum of all +cell types should be constrained to be < 1. When +\code{withOtherCells=FALSE}, there is additionally a constrain the the sum +of all cell types with reference profiles must be > 0.99.} + +\item{rangeBasedOptim}{(optional): when this is FALSE (the default), the +least square optimization is performed as described in \cite{EPIC} +paper, which is recommanded. +When this variable is TRUE, EPIC uses the variability of each gene +from the reference profiles in another way: instead of defining weights +(based on the variability) for the fit of each gene, we define a range of +values accessible for each gene (based on the gene expression value in +the reference profile +/- the variability values). The +error that the optimization tries to minimize is by how much +the predicted gene expression is outside of this allowed range of values.} } \value{ A list of 3 matrices:\describe{ @@ -96,23 +122,20 @@ The names of the genes in the bulk samples, the reference samples and in the gene signature list need to be the same format (gene symbols are used in the predefined reference profiles). The full list of gene names don't need to be exactly the same between the reference and bulk samples: \emph{EPIC} -will use the intersect of the genes. +will use the intersection of the genes. } \examples{ -res1 <- EPIC(hoek_data$rawCounts) +res1 <- EPIC(melanoma_data$counts) res1$cellFractions -res2 <- EPIC(hoek_data$rawCounts, BRef) -res3 <- EPIC(bulk=hoek_data$rawCounts, reference=BRef) -res4 <- EPIC(hoek_data$rawCounts, reference="BRef") -res5 <- EPIC(hoek_data$rawCounts, mRNA_cell_sub=c(Bcells=1, otherCells=5)) -res6 <- EPIC(bulk=hoek_data$rawCounts, reference="BRef.tpm") +res2 <- EPIC(melanoma_data$counts, TRef) +res3 <- EPIC(bulk=melanoma_data$counts, reference=TRef) +res4 <- EPIC(melanoma_data$counts, reference="TRef") +res5 <- EPIC(melanoma_data$counts, mRNA_cell_sub=c(Bcells=1, otherCells=5)) # Various possible ways of calling EPIC function. res 1 to 4 should # give exactly the same outputs, and the elements res1$cellFractions # should be equal to the example predictions found in -# hoek_data$cellFractions.pred for these first 4 results. +# melanoma_data$cellFractions.pred for these first 4 results. # The values of cellFraction for res5 will be different due to the use of # other mRNA per cell values for the B and other cells. -# And res6 will also give different results and is not advised: the reference -# BRef.tpm corresponds to tpm while the bulk was given as raw counts. } diff --git a/man/EPIC.package.Rd b/man/EPIC.package.Rd index 34870d5..79593e3 100644 --- a/man/EPIC.package.Rd +++ b/man/EPIC.package.Rd @@ -7,9 +7,9 @@ \title{EPIC: a package to Estimate the Proportion of Immune and Cancer cells from tumor gene expression data.} \description{ -EPIC package provides the function and immune cell reference profiles to -estimate the proportion of immune and other cells from bulk gene expression -data. +EPIC package provides the function and cell reference profiles to +estimate the proportion of immune, stromal, endothelial and cancer or other +cells from bulk gene expression data. } \section{EPIC functions}{ @@ -19,14 +19,13 @@ data. \section{Included datasets}{ -\code{\link{BRef}}, \code{\link{BRef.tpm}}: reference profiles from - circulating immune cells. +\code{\link{BRef}}: reference profiles from circulating immune cells. - \code{\link{TRef.tpm}}: reference profiles from immune cells obtained - from single cell data of tumor infiltrating cells from melanoma patients. + \code{\link{TRef}}: reference profiles from tumor infiltrating non-malignant + cells obtained from single cell data of melanoma patients. -\code{\link{hoek_data}}: example dataset containing data from Hoek et al, - 2015, PLoS One. +\code{\link{melanoma_data}}: example dataset containing data from lymph nodes + from patients with metastatic melanoma. \code{\link{mRNA_cell_default}}: values of mRNA per cell for the main cell types. diff --git a/man/TRef.Rd b/man/TRef.Rd new file mode 100644 index 0000000..2af2b18 --- /dev/null +++ b/man/TRef.Rd @@ -0,0 +1,35 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/EPIC_descr.R +\docType{data} +\name{TRef} +\alias{TRef} +\title{Reference profiles obtained from single cell data of tumor infiltrating +cells.} +\format{A list of 3 elements: \describe{ \item{$refProfiles, + $refProfiles.var}{Matrices (nGenes x nRefCells) of the gene expression (in + TPM counts) from the reference cells and the variability of this gene + expression for each gene and each cell type} \item{$sigGenes}{A list of + signature genes used to deconvolve the cell proportions} }} +\source{ +\url{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72056} +} +\usage{ +TRef +} +\description{ +A dataset containing the reference profiles obtained from various +tumor infiltrating non-malignant cell types: \emph{B cells}, +\emph{cancer-associated fibroblasts}, \emph{CD4 T cells}, \emph{CD8 T cells}, +\emph{endothelial cells}, \emph{Macrophages} and \emph{NK cells}. +} +\details{ +These were obtained from single-cell RNA-seq data from 9 donors from +the publication of \cite{Tirosh et al., 2016, Science}. The samples +come from melanoma tumors (extracted from primary tumors and non-lymphoid +tissue metastases). The classification for each sample with +respect to each cell type is the one given by Tirosh et al., except for +the CD4 T cells and CD8 T cells, that were identified from the T cells based +on the expression of CD4, CD8A and CD8B as described in \cite{EPIC} +publication. +} +\keyword{datasets} diff --git a/man/TRef.tpm.Rd b/man/TRef.tpm.Rd deleted file mode 100644 index f1511c9..0000000 --- a/man/TRef.tpm.Rd +++ /dev/null @@ -1,30 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/EPIC_descr.R -\docType{data} -\name{TRef.tpm} -\alias{TRef.tpm} -\title{Reference profiles obtained from single cell data of tumor infiltrating -cells.} -\format{A list of 3 elements: \describe{ \item{$refProfiles, - $refProfiles.var}{Matrices (nGenes x nRefCells) of the normalized gene - expression from the reference cells and the variability of this gene - expression for each gene and each cell type} \item{$sigGenes}{A list of 80 - signature genes used to deconvolve the cell proportions} }} -\source{ -\url{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72056} -} -\usage{ -TRef.tpm -} -\description{ -A dataset containing the reference profiles given in TPM from various cell -types: \emph{B-}, \emph{NK-}, \emph{T-cells} and \emph{Macrophages}. -} -\details{ -These were obtained from single-cell RNA-seq data from 9 donors from -the publication of Tirosh et al., 2016, Science. The samples -come from cancer metastases of melanoma (extracted from primary tumors -and non-lymphoid tissue metastases). The classification for each sample with -respect to each cell type is the one given by Tirosh et al. -} -\keyword{datasets} diff --git a/man/hoek_data.Rd b/man/hoek_data.Rd deleted file mode 100644 index 4f6f9bb..0000000 --- a/man/hoek_data.Rd +++ /dev/null @@ -1,33 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/EPIC_descr.R -\docType{data} -\name{hoek_data} -\alias{hoek_data} -\title{Example dataset containing data from Hoek et al, 2015, PLoS One.} -\format{This is a list of 3 elements: \describe{ - \item{$rawCounts}{(matrix of 51574 genes x 2 donors) The raw read counts - from the two donors. It has been obtained by mapping the original - fastq files to \emph{hg19} genome with help of \emph{tophat} and - \emph{htseq-count}.} - \item{$cellFractions.obs}{(matrix of 2 donors x 6 cell types) The - proportions of the different immune cells in the PBMC from the 2 donors, - as measured by FACS by Hoek et al.} - \item{$cellFractions.pred}{(matrix of 2 donors x 7 cell types) The - proportions of the different immune cells and of a potential additional - uncharacterized cell, as predicted by EPIC based on the rawCounts and - the profiles \code{reference=BRef}.} -}} -\source{ -The description of this data can be found here: - \href{http://dx.doi.org/10.1371/journal.pone.0118528}{link to paper} - and \href{https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-64655}{link - to data}. -} -\usage{ -hoek_data -} -\description{ -This dataset contains a subset of the full Hoek et al data. It contains only -the data from the two healthy donors PBMC before influenza vaccination. -} -\keyword{datasets} diff --git a/man/melanoma_data.Rd b/man/melanoma_data.Rd new file mode 100644 index 0000000..b8c4d2b --- /dev/null +++ b/man/melanoma_data.Rd @@ -0,0 +1,37 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/EPIC_descr.R +\docType{data} +\name{melanoma_data} +\alias{melanoma_data} +\title{Example dataset containing data from lymph nodes from patients with +metastatic melanoma.} +\format{This is a list of 3 elements: \describe{ + \item{$counts}{(matrix of 49902 genes x 4 donors) The TPM normalized counts + from the four donors. It has been obtained by mapping RNA-seq data to + \emph{hg19} genome with help of \emph{RSEM}. Ensembl ID were then + converted to gene names, and genes with duplicated entries were + merged together by summing their counts.} + \item{$cellFractions.obs}{(matrix of 4 donors x 6 cell types) The + proportions of the different cell types measured by FACS (the + "other_cells" correspond to the live cells without any marker of the + other given cell types).} + \item{$cellFractions.pred}{(matrix of 4 donors x 8 cell types) The + proportions of the different cell types, as predicted by EPIC based on + the reference profiles \code{TRef}.} +}} +\source{ +The description of this data can be found here: + \href{http://www.biorxiv.org/content/early/2017/03/17/117788}{link to paper} + and \href{https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93722}{link + to data}. +} +\usage{ +melanoma_data +} +\description{ +This is the dataset obtained for \cite{EPIC} publication. It contains the +gene expression from lymph node samples from four patients with melanoma, and +it contains also the proportions of the main immune cell types and of +melanoma cells, as measured by FACS. +} +\keyword{datasets} diff --git a/tests/testthat/test_EPIC_fun.R b/tests/testthat/test_EPIC_fun.R index 120820c..abc17a1 100644 --- a/tests/testthat/test_EPIC_fun.R +++ b/tests/testthat/test_EPIC_fun.R @@ -4,19 +4,19 @@ test_that("Test of bad inputs", { a <- matrix(1:20, nrow=5) expect_error(EPIC(bulk=a), "There are only 0 signature") refStr <- "unknownRef" - expect_error(EPIC(bulk=hoek_data$rawCounts, reference=refStr), + expect_error(EPIC(bulk=melanoma_data$counts, reference=refStr), paste0(refStr, ".* not part of the allowed references")) tRef <- BRef; tRef$sigGenes <- NULL - expect_error(EPIC(bulk=hoek_data$rawCounts, reference=tRef), + expect_error(EPIC(bulk=melanoma_data$counts, reference=tRef), "needs to contain .* 'sigGenes'") tRef <- BRef; tRef$refProfiles.var <- tRef$refProfiles.var[ nrow(tRef$refProfiles.var):1,] - expect_error(EPIC(bulk=hoek_data$rawCounts, reference=tRef), + expect_error(EPIC(bulk=melanoma_data$counts, reference=tRef), "dimnames of .*refProfiles.*refProfiles.var.*same") }) -test_that("Test for correct result on hoek data with default input", { - testFract <- EPIC(hoek_data$rawCounts)$cellFractions - expect_equal(testFract, hoek_data$cellFractions.pred) +test_that("Test for correct result on melanoma data with default input", { + testFract <- EPIC(melanoma_data$counts)$cellFractions + expect_equal(testFract, melanoma_data$cellFractions.pred) }) diff --git a/vignettes/info.Rmd b/vignettes/info.Rmd index ae4a198..5987de5 100644 --- a/vignettes/info.Rmd +++ b/vignettes/info.Rmd @@ -10,14 +10,17 @@ vignette: > --- ## Description -Package implementing EPIC method to estimate the proportion of immune and cancer -cells from bulk gene expression data. It is based on reference gene expression -profiles for the main immune cell types and it predicts the proportion of these -cells and of the remaining "other cells" that are the remaining cells (cancer, -stromal, endothelial) for which no reference profile is given. +Package implementing EPIC method to estimate the proportion of immune, stromal, +endothelial and cancer or other cells from bulk gene expression data. +It is based on reference gene expression profiles for the main non-malignant +cell types and it predicts the proportion of these cells and of the remaining +"other cells" (that are mostly cancer cells) for which no reference profile is +given. ## Usage -The main function in this package is `EPIC`. It needs as input a matrix of the raw gene expression counts from the samples for which to estimate cell proportions. One can also define the reference cells to use +The main function in this package is `EPIC`. It needs as input a matrix of the +TPM (or RPKM) gene expression from the samples for which to estimate cell +proportions. One can also define the reference cells to use ```{r, eval = FALSE} out <- EPIC(bulk = bulkSamplesMatrix) out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList)