Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large matrix errors (more than 2^31-1 non-zero entries) [on large datasets] #29

Open
ghost opened this issue Feb 28, 2024 · 3 comments
Open

Comments

@ghost
Copy link

ghost commented Feb 28, 2024

I'm working with a matrix of over 200,000 cells and 36,000 genes.

"I first tried the 'RunALRA' function in Seurat. Then, I extracted the expression table, converted it into a matrix, and attempted to use ALRA (including alra.low.memory), but encountered the following error.

"Attempting to construct a sparseMatrix with at least 2^31-1 non-zero entries."

It appears that the dgCMatrix conversion process fails because a large matrix exceeds the limit. Modifying the ALRA function code to use a general matrix format instead of dgCMatrix is possible, but operating it realistically is challenging due to the near 100Gb size

If there is a function or method to address this issue with large datasets like mine, I would appreciate any suggestions. Below are the alternatives I am currently considering. I would be grateful if you could share your opinions on them as well.

Currently, I am considering the following three alternatives :

For Alternative A, Imputation is performed for each sample and integrated into one. However, based on the experiences of other users registered in this issue, it seems that normalizing and imputing the integrated data yields more accurate results.

For Alternative B, After normalization is performed on the integrated data, imputation is performed by reducing the number of genes. However, there may be different trends compared to when imputation is performed with the entire gene.

For Alternative C, (If celltype information is known) Immediately perform normalization on the integrated data and then perform subsetting for each celltype to separate them. Imputation is then performed for each cell type and then integrated again. I think this alternative has the advantage of allowing the use of any gene. Additionally, certain genes may not be expressed at all or may be expressed only in certain cell types. I hope that the biological perspective that it can be expressed differently only in certain cells can be applied. Additionally, since I have performed normalization for the entire cell population, so I believe it will be possible to compare the expression levels between cell types in the integrated data after conducting ALRA Imputation for each cell type. If there are any suggestions for revising my thoughts, I would appreciate hearing them.

@ghost
Copy link
Author

ghost commented Feb 28, 2024

In my attempts, it seems that "Alternative C" provides more meaningful results in reflecting biological characteristics compared to "Alternative B". In my data, the results such as expression level and proportion of expressing cells for genes that are either expressed or not expressed in certain cell types (and in the comparison between Normal and Tumor as well) are more accurate.
For "Alternative B," it seems that in my data, there is a tendency for the expression levels or the proportion of expressing cells to be exaggerated.

@ghost ghost closed this as completed Feb 28, 2024
@ghost ghost reopened this Feb 28, 2024
@ghost
Copy link
Author

ghost commented Mar 5, 2024

@JunZhao1990 @linqiaozhi
Firstly, I want to express my gratitude for developing ALRA. It's an incredibly useful tool. However, while analyzing large datasets in R, I've encountered issues related to large matrices. Is there by any chance a way to utilize ALRA within a Python-based environment like Scanpy?

@ghost
Copy link
Author

ghost commented Mar 5, 2024

@JunZhao1990 @linqiaozhi Firstly, I want to express my gratitude for developing ALRA. It's an incredibly useful tool. However, while analyzing large datasets in R, I've encountered issues related to large matrices. Is there by any chance a way to utilize ALRA within a Python-based environment like Scanpy?

P.s.
At the following link, I found a translation of ALRA into Python code from 6 years ago. Considering the updates so far, do you think this ALRA analysis can still be effectively used in Python?
https://github.com/pavlin-policar/ALRA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants