-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching improvements #3989
base: master
Are you sure you want to change the base?
Caching improvements #3989
Conversation
* Disk cache generated code * Only generate loopy code on one rank and broadcast to others
|
|
…haviour of SPMD strict mode to raise an error when called non-collectively
Needs to be collective. Therefore the SPMD_STRICT partition does not make much sense.
|
||
value = local_cache.get(key, CACHE_MISS) | ||
|
||
def parallel_cache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we duplicated this logic to avoid a comm.allgather
. I have determined that this is essential to avoid deadlocks (hits on not all of the ranks will otherwise deadlock). Since both code paths are now the same apart from a handful of debug()
statements I have removed the fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were caching on comm
. How can we get to the situation where we have cache_hit on some and cache_miss on the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens when we disk cache things, which uses the same code path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I do not actually understand how, apart from having race conditions, we could actually have a hit on some ranks but not others. Ideas welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it turns out that the problem was actually that the keys were not consistent between ranks. I have enabled a check for this when PYOP2_SPMD_STRICT
is enabled (which it now is in CI) and this has thrown up a number of times where we are doing the wrong thing and risking deadlocks.
Needs firedrakeproject/fiat#132
PYOP2_SPMD_STRICT
in CI, which includes some extra checksThanks to @pbrubeck for flagging this. Apparently these changes make a big difference.