-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large matrix errors (more than 2^31-1 non-zero entries) [on large datasets] #29
Comments
In my attempts, it seems that "Alternative C" provides more meaningful results in reflecting biological characteristics compared to "Alternative B". In my data, the results such as expression level and proportion of expressing cells for genes that are either expressed or not expressed in certain cell types (and in the comparison between Normal and Tumor as well) are more accurate. |
@JunZhao1990 @linqiaozhi |
P.s. |
I'm working with a matrix of over 200,000 cells and 36,000 genes.
"I first tried the 'RunALRA' function in Seurat. Then, I extracted the expression table, converted it into a matrix, and attempted to use ALRA (including alra.low.memory), but encountered the following error.
"Attempting to construct a sparseMatrix with at least 2^31-1 non-zero entries."
It appears that the dgCMatrix conversion process fails because a large matrix exceeds the limit. Modifying the ALRA function code to use a general matrix format instead of dgCMatrix is possible, but operating it realistically is challenging due to the near 100Gb size
If there is a function or method to address this issue with large datasets like mine, I would appreciate any suggestions. Below are the alternatives I am currently considering. I would be grateful if you could share your opinions on them as well.
Currently, I am considering the following three alternatives :
For Alternative A, Imputation is performed for each sample and integrated into one. However, based on the experiences of other users registered in this issue, it seems that normalizing and imputing the integrated data yields more accurate results.
For Alternative B, After normalization is performed on the integrated data, imputation is performed by reducing the number of genes. However, there may be different trends compared to when imputation is performed with the entire gene.
For Alternative C, (If celltype information is known) Immediately perform normalization on the integrated data and then perform subsetting for each celltype to separate them. Imputation is then performed for each cell type and then integrated again. I think this alternative has the advantage of allowing the use of any gene. Additionally, certain genes may not be expressed at all or may be expressed only in certain cell types. I hope that the biological perspective that it can be expressed differently only in certain cells can be applied. Additionally, since I have performed normalization for the entire cell population, so I believe it will be possible to compare the expression levels between cell types in the integrated data after conducting ALRA Imputation for each cell type. If there are any suggestions for revising my thoughts, I would appreciate hearing them.
The text was updated successfully, but these errors were encountered: