Replies: 4 comments
-
Thanks! Learning rate: The default being used is 200. Thanks for bringing this up. I completely overlooked it. The implementation of SG-tSNE has not exposed this parameter. I have create an issue here and hope that the developers fix it soon. Once that is up, I will set learning rate dynamically based on no. of cells, as has been suggested. SG part:: You are mostly correct, the sum of rows in the affinity matrix add upto 1. The value of Initialization: Thank you for brining this to our notice. There is a major error in the explanation and will be corrected in the preprint asap. So, we perform PCA reduction ( |
Beta Was this translation helpful? Give feedback.
-
Thanks for the replies. Re learning rate -- great! Some remarks here: if the SG-tSNE implementation has the factor 4 in the gradient (as sklean does), then the heuristic should be n/48 and not n/12. Also, 12 here comes from the early exaggeration, so I'd suggest you consider changing default alpha to 12 and early_iter to 250 to, but if you prefer to keep your defaults, then maybe the learning rate should be n/10 (or n/40). In any case for n=4mln you will change the learning rate from 200 to something like 100000 which is a HUGE change, so you definitely should run your benchmarks and see if everything still holds. I have no experience with SG-tSNE implementation so don't know if there are any caveats here. SG part: yes, I noticed that you set lambda to 1. This is what makes the affinities to sum to 1, if I understood correctly. The UMAP weights will not sum to 1, so you need to fix that before running t-SNE, and it seems that SG-tSNE will do that for you, i.e. choose some gamma exponents so that the weights in each row sum to 1. Correct me if I am wrong. Re initialization -- this makes perfect sense and is actually what I suspected you are doing :-) As we show in https://www.nature.com/articles/s41587-020-00809-z, it's important to have informative initialization, but there are many possible choices for what is informative. I can see two caveats here: (1) when you perform the PCA on And (2) the way you do it, you will have lots of points that are exactly overlapping in the initialization. I found this to cause a lot of numerical problems especially for Barnes-Hut but also for FFT the way it's implemented in FIt-SNE. So it's actually beneficial to add a tiny amount of noise to the initialization. But I don't know if it matters in the SG-tSNE implementation. In any case, your results look reasonable, so maybe none of these caveats matters for you. |
Beta Was this translation helpful? Give feedback.
-
Learning rate: These are some fantastic suggestions! Thanks. It will interesting to see how the embedding benefits from an increased learning rate. I expect that runtime will get prolonged when using higher values for the learning rate. It will be interesting to compare the results between low learning rate + more iterations and high learning rate + fewer iterations. From the articles you shared, it seems that setting a high learning rate solves issues that a large number of iterations (under a reasonable limit) might not be able to solve. it will be interesting how the local neighborhood is preserved in these comparisons. SG part: Yes, lambda is set to 1 but this is not what makes the row affinities sum to 1. This happens here in the SG-tSNE code like a pre-processing step. The lambda rescaling happens here. I think that the lambda rescaling step is akin to smooth_knn_dist. Hence, by default, the lambda rescaling is turned off (by setting the value to 1). In UMAP, the equivalent to lambda would be Initialization: Again, thank you for a very constructive feedback here.
|
Beta Was this translation helpful? Give feedback.
-
Yes, but what high learning rate solves is the incomplete early exaggeration phase, because increasing the number of iterations without increasing the length of the early exaggeration phase won't help it. I don't think this will matter much for local neighborhood preservation though... So I am actually very curious to see how the learning rate will affect your local neighborhood metric.
Hmm, I am not so sure. The fist link in your comment goes to a place where the entire P matrix is normalized to sum to 1. But not the individual rows... Equations 5-6 in the original paper suggest to me that rows are normalized to 1 before that, via lambda rescaling. But I see now that due to Equation 6 they will sum to 1 independent of the value of lambda. |
Beta Was this translation helpful? Give feedback.
-
Great paper and great package. Amazing work!
I am specifically interested in your UMAP/t-SNE comparisons and benchmarks and am now trying to figure out the SG-t-SNE-Pi default parameters that you use. As far as I understood your API, your defaults are
max_iter=500, early_iter=200, alpha=10
where alpha denotes early exaggeration coefficient. I noticed that 10 and 200 are slightly different from the default values in most existing t-SNE implementations (12 and 250), I wonder why. But they are pretty close so it does not really matter. What is not mentioned though is the learning rate. Learning rate can have a huge influence on t-SNE embeddings and speed of convergence. See https://www.nature.com/articles/s41467-019-13056-x and https://www.nature.com/articles/s41467-019-13055-y that recommend setting learning rate to n/12 where n is the sample size. What learning rate is used by the SG-t-SNE-Pi implementation that you use?Unrelated, if I understood the "SG" part and your implementation correctly, you construct a kNN graph using k=10, then assign UMAP weights to the edges, and then when running t-SNE, SG-t-SNE will normalize each row of the affinity matrix to sum to 1. Then symmetrize and run t-SNE as usual. Right? If I understood correctly, then this is pretty much exactly how it should be implemented in Scanpy soon, see pending scverse/scanpy#1561 by @pavlin-policar. Nice.
Finally, I am not entirely sure I understood your initialization approach. It's great that you use the same initialization for t-SNE and UMAP (another relevant paper here: https://www.nature.com/articles/s41587-020-00809-z). But I am confused by the following bit:
Is this a binary matrix that has 1 in position ij if cell i belongs to cluster j? If so, I'm not quite sure what's the point of running PCA on such a matrix? I'm probably misunderstanding.
Beta Was this translation helpful? Give feedback.
All reactions