Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing separation #121

Closed
zalky opened this issue Mar 30, 2018 · 3 comments
Closed

Reproducing separation #121

zalky opened this issue Mar 30, 2018 · 3 comments

Comments

@zalky
Copy link

zalky commented Mar 30, 2018

Hi Greg,

Thanks for all the work in putting this together. I'm trying to reproduce some of the results, and can't quite seem to get the separation that is shown in the paper.

At the simplest, I'm loading the existing results saved in encoded_rnaseq_onehidden_warmup_batchnorm.tsv, and then plotting:

plt.figure(figsize=(6, 6))
plt.scatter(encoded_rnaseq_df.iloc[:, 1], encoded_rnaseq_df.iloc[:, 2])
plt.xlabel('Latent Feature 1')
plt.ylabel('Latent Feature 2');

(assuming line 4 above should be plt.ylabel) which produces:

screen shot 2018-03-30 at 10 38 50 am

Which seems to have less separation than the sanity check in the IPython notebook. Similarly when I plot feature 53 against 66:
screen shot 2018-03-30 at 12 50 44 am

You can kind of see the cluster, but there is less separation than what is produced in Fig. 3B in the paper. When I train the model from scratch to produce the embeddings, I get similar separation to the saved results, and not the clear separation that you see in the papers.

Am I missing something?

@gwaybio
Copy link
Collaborator

gwaybio commented Mar 31, 2018

Hi @zalky

Thanks for your interest in the paper and code! The code to reproduce figure 3B is in scripts/viz/feature_activation_plots.R and is discussed in #65 . Essentially, I believe that the size of the points given above obscures separation.

At a more fundamental level, the sample activation scores (for example, the ones given in encoded_rnaseq_onehidden_warmup_batchnorm.tsv) are inherently unstable. The results depend on random initialization conditions prior to training. We are working on this issue currently. But for now, it would be incorrect to train a different Tybalt model and think that the numerical designation of the latent encodings are consistent.

Thanks!
Greg

@zalky
Copy link
Author

zalky commented Apr 4, 2018

Hi Greg, thanks for the response!

I agree that given the non-deterministic training model you would see some variation in results, but I was a bit surprised by the degree of difference. However, I think I have a lead on what is going on:

I was loading the encoded data in encoded_rnaseq_onehidden_warmup_batchnorm.tsv with:

encoded_df = pd.read_table(encoded_file, index_col=0)

As is generally seen throughout the python code. Unfortunately I could not get this to reproduce the figures. However, if you load the encoded data without specifying the index column:

encoded_rnaseq_df = pd.read_table(encoded_file)

Then I can successfully reproduce Fig. 3B exactly (sans clinic colours) for encoding 53 and 66:

screen shot 2018-04-04 at 1 03 47 pm

But by not specifying index_col=0, this inserts the sample labels as the first encoding. When plotting one encoding against another, this means the encoding numbers will be off by one.

Going back, specifying index_col=0, and re-plotting encoding 52 vs 65 re-produces Fig. 3B. Re-training the model from scratch also produces figures much more in line with the paper, as long as you take into account the off by one encoding numbers.

I haven't pursued this issue any further than this, but is it possible that somewhere, maybe in the R code, something may be loading the sample labels in the first encoding column, thereby resulting in the encoding labels being off by one?

@gwaybio
Copy link
Collaborator

gwaybio commented Apr 7, 2018

Glad this was figured out - I agree that this issue is a potential pitfall in analyses (see #86)- I will bump this up in priority. Thanks again!

@gwaybio gwaybio closed this as completed Apr 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants