-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nuts-rs's metric adaptation #311
Comments
There's also a work-in-progress paper here explaining the theory behind the method: https://github.com/aseyboldt/covadapt-paper/blob/main/main.pdf |
Would be great to have a second implementation of this!
This is no longer up to date unfortunately, sorry about that. I guess the ground truth really is just the rust code right now, but roughly:
The logic for the mass matrix update is here: https://github.com/pymc-devs/nuts-rs/blob/main/src/adapt_strategy.rs#L202-L220 If something is unclear (or could be improved :-) ) feel to ask. |
Thanks, @aseyboldt for the pointers! I'll check back in if I get lost. Btw, I was scanning your draft of the paper, and it seems like a really nice contribution. I'm excited to read the finished version. |
Why this particular initialization choice? Sometimes |
That's a good question, initially I was using the square of the gradient. From some experiments I'm pretty sure that that just doesn't work in practice though (but verification of that would be great ;-) ). So unfortunately I don't really have a good theoretical answer, more of a "that's what turned out to work well"... |
@sethaxen Someone also asked on the stan slack, and I made a small notebook for comparing those with a diagonal normal posterior: https://gist.github.com/aseyboldt/46ce8a0967026fd3b7ae11a53c884975 |
@aseyboldt that was me 👍 I wonder whether the amount of regularization has to depend on the number of dimensions, or whether one would want to use some hierarchical model, or whether one might want to sample a few additional points (from the initialization distribution) to get a better "feel" for the distribution of the gradients. |
Maybe it would make sense to use pathfinder to get to the typical set first, then do the metric adaptation from initialization points from the typical set? |
@nsiccha I still think looking at a standard normal is a bit misleading. It is I think quite normal that the initial mass matrix gets worse with increasing dimension, as the condition number is looking at the max and min eigenvalues, and with more dimensions there are just more variables that you could scale incorrectly. This also happens if you initialize using the identity and vary the true (diagonal) covariance. With the code from the notebook and So both initialization methods get worse when n_dim increases, but in both cases using the gradient to initialize is better. I think more important is what happens in real world posteriors. And there I observed that initializing with the grad avoids a phase very early in sampling where the step size is very small and tree_size very big, so where we do a lot of gradient evals for very little change in position.
Yes, running pathfinder or advi (which is also pretty straight forward to do using the fisher distance by the way) might improve things further I think. There is also quite a bit of relatively low hanging fruit with generalizing this to low rank updates to the mass matrix (see for instance an experiment that I think could be simplified quite a bit here: https://github.com/aseyboldt/covadapt) or non-linear transformations of the posterior (normalizing flows etc). |
Absolutely. I've also realized that the dependence on the dimension is lower than I thought, or rather is less important for posteriors further from a standard normal. The above experiments still give you the "pseudo-theoretical" justification for why to initialize the way you do: Depending on the posterior, the grad-initialization performs best in expectation. 🎉 |
Sounds like a job for posteriordb |
Thanks @aseyboldt and @nsiccha for this discussion! When I have a chance I'd like to play with some of these metric initialization experiments to build some intuition for why this works.
Yeah it feels like there's something here. Pathfinder similarly uses a low-rank-update matrix for the metric, but it tends to not do well when the Hessian of the negative log density is not everywhere positive definite. I wonder if some of the ideas of both methods could be unified/generalized.
💯 This is the way to go for benchmarking. FYI, there's a Julia wrapper: https://github.com/sethaxen/PosteriorDB.jl. BridgeStan.jl should be registered soon, and then I have some code I'll release in a package that converts a |
@mike-lawrence That's where the plot in the description comes from :-) |
I'm excite 👀 |
Discussion on Stan Slack for any of the Julia folks not in the Stan Slack. |
nuts-rs is a Rust implementation of NUTS that is wrapped by nutpie for use in PyMC. It has a novel metric adaptation approach, described as:
On Mastodon, @aseyboldt shared the following plot:
On Slack, he explained:
The models were selected from posteriordb. In general, this seems to outperform Stan's metric adaptation and also work well for more models. It also seems to allow for much shorter warm-up phases.
The text was updated successfully, but these errors were encountered: