Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching improvements #3989

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

Caching improvements #3989

wants to merge 24 commits into from

Conversation

connorjward
Copy link
Contributor

@connorjward connorjward commented Jan 23, 2025

Needs firedrakeproject/fiat#132

  • Disk cache generated code
  • Only generate loopy code on one rank and broadcast to others
  • Use PYOP2_SPMD_STRICT in CI, which includes some extra checks
  • Various unsafe hashing bug fixes

Thanks to @pbrubeck for flagging this. Apparently these changes make a big difference.

* Disk cache generated code
* Only generate loopy code on one rank and broadcast to others
@connorjward connorjward requested a review from ksagiyam January 23, 2025 16:21
Copy link

github-actions bot commented Jan 23, 2025

TestsPassed ✅Skipped ⏭️Failed ❌
Firedrake real8119 ran7402 passed717 skipped0 failed

Copy link

github-actions bot commented Jan 23, 2025

TestsPassed ✅Skipped ⏭️Failed ❌
Firedrake complex8105 ran6559 passed1546 skipped0 failed

pyop2/global_kernel.py Outdated Show resolved Hide resolved
pyop2/global_kernel.py Outdated Show resolved Hide resolved

value = local_cache.get(key, CACHE_MISS)

def parallel_cache(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we duplicated this logic to avoid a comm.allgather. I have determined that this is essential to avoid deadlocks (hits on not all of the ranks will otherwise deadlock). Since both code paths are now the same apart from a handful of debug() statements I have removed the fork.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were caching on comm. How can we get to the situation where we have cache_hit on some and cache_miss on the others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens when we disk cache things, which uses the same code path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I do not actually understand how, apart from having race conditions, we could actually have a hit on some ranks but not others. Ideas welcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out that the problem was actually that the keys were not consistent between ranks. I have enabled a check for this when PYOP2_SPMD_STRICT is enabled (which it now is in CI) and this has thrown up a number of times where we are doing the wrong thing and risking deadlocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants