-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save space at merge step #200
Comments
Dear @zyh4482 , Andre |
@akahles It's so great to hear about that! So many thanks for your kindly advice! Would you mind sharing some codes or hints to achieve this kind of task? |
Sure - what you would need to do to combine the quantifications of a single merged graph across many bam-file batches is to join the count matrices. Let's have a look at the structure of the count file. If you have
This is a file that is delivered with SplAdder as part of the test suite. The output should look like the following:
This test case is run over a small cohort of 20 test samples. That is any matrix where you see a 20, you would need to concatenate across your batches along this axis. The remaining numbers should be the same for all of your quantification files, as they are determined by the underlying graph, which would be the same for your batch-wise quantification. |
Sorry that this is a bit complicated, but building graphs on >100TB input data is nothing that is usually done and counts as a "special case" :) |
Great! Thanks for your kindly help. I'll run some test and post my feedback later. Best regards to you |
@akahles Hello! I've made some tests and it works! Thank you very much. But another issue occurs. My dataset has largely overlapped with your previous work on Cancer Cell 2018. So I tried to check if the result is okay to use regarding to your benchmarks. The GFF3 file is downloaded from GDC: merge_graphs_alt_3prime_C2.confirmed.gff3. HDF5 file is downloaded in the same way as well. I checked alt_3prime.confirmed hdf5 but there are some confusing results.
Here is the event_pos regarding to gene ENSG00000187634.6. hg19 reference exons coordinates can be found here. According to the GFF3, there are 21 confirmed alt_3prime events. For example:
However, based on the description of hdf5 In addition, taking alt_3prime.5 as an example: The exon coordinates of this minigene cannot be matched to any event_pos. May I ask:
|
Hello!
I'm working on large cohort with thousands of samples ( > 100 TB). However, due to the limitation of storage, it is not possible to keep all aligned reads at the same time. I found no matter which mode I choose, event quantification still requires me to pass all the bams to SplAdder. Am I correct? In my case, is it impossible to use SplAdder for detecting events from thousands of bams?
The text was updated successfully, but these errors were encountered: