Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CelFiE with normal lung tissue samples #2

Open
seifudd opened this issue Feb 21, 2020 · 5 comments
Open

CelFiE with normal lung tissue samples #2

seifudd opened this issue Feb 21, 2020 · 5 comments

Comments

@seifudd
Copy link

seifudd commented Feb 21, 2020

Hi Christa,

Thank you for developing CelFiE. I am trying to run CelFiE on 2 samples (adult normal lung tissues-WGBS). After getting the coverage file from Bismark, I ran your prepare_bismark.sh script and created the input to CelFiE with the reference TIMs you provided. I ran CelFiE using the following command for both samples:

python em.py SRR3269863_reference_file_tims.txt 00-deconvolution_with_celfie 1 1000 0 1 0.001 1
python em.py SRR3274240.2_reference_file_tims.txt 00-deconvolution_with_celfie 1 1000 0 1 0.001 100

However, the cell_proportions from CelFiE for both samples indicate >90% placenta which should not be the case.

I am attaching the input files here for each sample as well as the output from CelFiE. I assumed that the output array from the pickle file is ordered in the same way as your reference_file_key.txt.

SRR3269863_reference_file_tims.txt
SRR3274240.2_reference_file_tims.txt
CelFiE_deconvolution_results.xlsx

Any help will be appreciated.

Thanks, Fayaz

@dhanya-sudhakaran
Copy link

Dear Fayaz

I was facing the same issue when I tried using celfie for my dataset. With no unknowns, I get >90% as placenta contribution in non-pregnant samples.

I was wondering if you were able to solve your issue and if you tried any other tool for your samples. Appreciate your time!

Thank you
Dhanya

@seifudd
Copy link
Author

seifudd commented May 27, 2020

Hi Dhanya,

I apologize for the late reply. I ended up using meth_atlas for deconvolution: https://github.com/nloyfer/meth_atlas

Hope this helps.

Thanks, Fayaz

@christacaggiano
Copy link
Owner

Hi Dhanya and Fayaz, Sorry for the late reply. CelFiE is really designed to be used to fit multiple samples simultaneously. When you fit one sample at once, which seems to be the case with the data that Fayaz provided (I don't know about Dhanya's data), the EM will tend to learn a "custom" unknown for that sample. This makes sense intuitively, because if I was blindly trying to describe one sample without knowing much about the reference (which is an assumption of the model- that our reference tissues aren't perfect since both ENCODE and BLUEPRINT are noisy), then the best I could do to describe a sample is to just describe what I have in front of me.

I have found that CelFiE performs best with more than 10 samples fit at once (see figure 3 of our preprint).

Let me know if that helps to clear things up.

@seifudd
Copy link
Author

seifudd commented Aug 8, 2020

Hi Dhanya and Fayaz, Sorry for the late reply. CelFiE is really designed to be used to fit multiple samples simultaneously. When you fit one sample at once, which seems to be the case with the data that Fayaz provided (I don't know about Dhanya's data), the EM will tend to learn a "custom" unknown for that sample. This makes sense intuitively, because if I was blindly trying to describe one sample without knowing much about the reference (which is an assumption of the model- that our reference tissues aren't perfect since both ENCODE and BLUEPRINT are noisy), then the best I could do to describe a sample is to just describe what I have in front of me.

I have found that CelFiE performs best with more than 10 samples fit at once (see figure 3 of our preprint).

Let me know if that helps to clear things up.

Hi Christina,

Thank you for your reply. We have 16 samples (COVID19, cfDNA, WGBS) and I tried CelFie but, I'm getting the same results as I did with the individual sample i.e. it predicts that most of the cfDNA originates from the "placenta."

I am attaching the results here.

I am also attaching the input data.

One question I had was: are the reference TIMs you provide on hg38? or hg19? Maybe that's causing the issue?

Any help will be appreciated.

Thanks, fs

covid19_samples_reference_file_tims.txt

covid19_cell_proportions.xlsx

@seifudd
Copy link
Author

seifudd commented Aug 8, 2020

Hi Dhanya and Fayaz, Sorry for the late reply. CelFiE is really designed to be used to fit multiple samples simultaneously. When you fit one sample at once, which seems to be the case with the data that Fayaz provided (I don't know about Dhanya's data), the EM will tend to learn a "custom" unknown for that sample. This makes sense intuitively, because if I was blindly trying to describe one sample without knowing much about the reference (which is an assumption of the model- that our reference tissues aren't perfect since both ENCODE and BLUEPRINT are noisy), then the best I could do to describe a sample is to just describe what I have in front of me.
I have found that CelFiE performs best with more than 10 samples fit at once (see figure 3 of our preprint).
Let me know if that helps to clear things up.

Hi Christina,

Thank you for your reply. We have 16 samples (COVID19, cfDNA, WGBS) and I tried CelFie but, I'm getting the same results as I did with the individual sample i.e. it predicts that most of the cfDNA originates from the "placenta."

I am attaching the results here.

I am also attaching the input data.

One question I had was: are the reference TIMs you provide on hg38? or hg19? Maybe that's causing the issue?

Any help will be appreciated.

Thanks, fs

covid19_samples_reference_file_tims.txt

covid19_cell_proportions.xlsx

Hi Christina,

For your reference, this is the command that I used. There is a warning about division by zero at some point in the script but this could be due to no coverage. I changed the number of unknowns from 0 to 20 compared to the previous run.

Again, any help will be appreciated.

Thanks, fs

python /data/NHLBI_BCB/Sean_MethylSeq/10-tissue_of_origin_methylation_project/celfie/EM/em.py \

/data/NHLBI_BCB/Sean_MethylSeq/14_MKJ5249/02_methylseq_analysis_pipeline/02_tissue_of_origin_prediction/04_deconvolution_with_celfie/covid19_samples_reference_file_tims.txt
/data/NHLBI_BCB/Sean_MethylSeq/14_MKJ5249/02_methylseq_analysis_pipeline/02_tissue_of_origin_prediction/04_deconvolution_with_celfie
16
1000
20
1
0.001
100
writing to /data/NHLBI_BCB/Sean_MethylSeq/14_MKJ5249/02_methylseq_analysis_pipeline/02_tissue_of_origin_prediction/04_deconvolution_with_celfie/
finshed reading /data/NHLBI_BCB/Sean_MethylSeq/14_MKJ5249/02_methylseq_analysis_pipeline/02_tissue_of_origin_prediction/04_deconvolution_with_celfie/covid19_samples_reference_file_tims.txt

beginning generation of /data/NHLBI_BCB/Sean_MethylSeq/14_MKJ5249/02_methylseq_analysis_pipeline/02_tissue_of_origin_prediction/04_deconvolution_with_celfie/1_alpha.pkl

/data/NHLBI_BCB/Sean_MethylSeq/10-tissue_of_origin_methylation_project/celfie/EM/em.py:159: RuntimeWarning: invalid value encountered in true_divide
add_pseduocounts(1, np.nan_to_num(y/y_depths), y, y_depths)
/data/NHLBI_BCB/Sean_MethylSeq/10-tissue_of_origin_methylation_project/celfie/EM/em.py:160: RuntimeWarning: invalid value encountered in true_divide
add_pseduocounts(0, np.nan_to_num(y/y_depths), y, y_depths)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants