Caching improvements #3989

connorjward · 2025-01-23T16:21:34Z

Disk cache generated code
Only generate loopy code on one rank and broadcast to others
Use PYOP2_SPMD_STRICT in CI, which includes some extra checks
Various unsafe hashing bug fixes

Thanks to @pbrubeck for flagging this. Apparently these changes make a big difference.

* Disk cache generated code * Only generate loopy code on one rank and broadcast to others

github-actions · 2025-01-23T16:42:24Z

	Tests	Passed ✅	Skipped ⏭️	Failed ❌
Firedrake real	8119 ran	7402 passed	717 skipped	0 failed

github-actions · 2025-01-23T16:51:26Z

	Tests	Passed ✅	Skipped ⏭️	Failed ❌
Firedrake complex	8105 ran	6559 passed	1546 skipped	0 failed

pyop2/global_kernel.py

…haviour of SPMD strict mode to raise an error when called non-collectively

Needs to be collective. Therefore the SPMD_STRICT partition does not make much sense.

connorjward · 2025-01-28T15:58:47Z

pyop2/caching.py

-
-                    value = local_cache.get(key, CACHE_MISS)
-
+def parallel_cache(


Previously we duplicated this logic to avoid a comm.allgather. I have determined that this is essential to avoid deadlocks (hits on not all of the ranks will otherwise deadlock). Since both code paths are now the same apart from a handful of debug() statements I have removed the fork.

I thought we were caching on comm. How can we get to the situation where we have cache_hit on some and cache_miss on the others?

It happens when we disk cache things, which uses the same code path

Though I do not actually understand how, apart from having race conditions, we could actually have a hit on some ranks but not others. Ideas welcome.

So it turns out that the problem was actually that the keys were not consistent between ranks. I have enabled a check for this when PYOP2_SPMD_STRICT is enabled (which it now is in CI) and this has thrown up a number of times where we are doing the wrong thing and risking deadlocks.

scripts/firedrake-run-split-tests

…che-fixes

firedrake/tsfc_interface.py

Caching improvements

8d413cc

* Disk cache generated code * Only generate loopy code on one rank and broadcast to others

connorjward requested a review from ksagiyam January 23, 2025 16:21

always spit out log files

5ed1a91

ksagiyam requested changes Jan 23, 2025

View reviewed changes

pyop2/global_kernel.py Outdated Show resolved Hide resolved

pyop2/global_kernel.py Outdated Show resolved Hide resolved

connorjward added 7 commits January 24, 2025 00:13

Use a timeout method that preserves more information

8deaead

more cleaning

4d07082

fixupgs

8631b8c

fixup

d297a7e

Try and avoid race conditions

13168c0

Use strict SPMD behaviour to try and track these down. Also change be…

fb57af4

…haviour of SPMD strict mode to raise an error when called non-collectively

Refactor parallel_cache decorator

84a8ad7

Needs to be collective. Therefore the SPMD_STRICT partition does not make much sense.

connorjward commented Jan 28, 2025

View reviewed changes

scripts/firedrake-run-split-tests Outdated Show resolved Hide resolved

connorjward commented Jan 28, 2025

View reviewed changes

scripts/firedrake-run-split-tests Show resolved Hide resolved

connorjward added 10 commits January 28, 2025 15:59

Apply suggestions from code review

b8235f8

experimenting

6a869ae

fixup

37ae507

debugging

845b01c

-s to build

ed4ef2b

more print

318cfd4

improvements, hopefully fixed?

4c5bbe4

avoid race conditions, does this fix things?

2dd2da6

Merge branch 'master' into connorjward/more-cache-fixes

1aed273

Merge remote-tracking branch 'origin/master' into connorjward/more-ca…

ef5d03b

…che-fixes

JHopeCollins mentioned this pull request Jan 30, 2025

CompilationError when using spatial parallel firedrakeproject/asQ#214

Open

connorjward added 3 commits January 30, 2025 15:14

Add extra SPMD_STRICT check

5c1b9f5

Is is not fixed?

1da4d5c

Fix bad hashing

81e8cb4

connorjward commented Jan 31, 2025

View reviewed changes

firedrake/tsfc_interface.py Show resolved Hide resolved

connorjward added 2 commits January 31, 2025 13:31

Point to FIAT branch

22054ed

linting

6592cdf

connorjward mentioned this pull request Jan 31, 2025

Fix quadrature rule hash firedrakeproject/fiat#132

Open

connorjward requested review from ksagiyam and JHopeCollins January 31, 2025 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching improvements #3989

Caching improvements #3989

connorjward commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 23, 2025 •

edited

Loading

connorjward Jan 28, 2025

ksagiyam Jan 28, 2025

connorjward Jan 28, 2025

connorjward Jan 28, 2025

connorjward Jan 31, 2025

Caching improvements #3989

Are you sure you want to change the base?

Caching improvements #3989

Conversation

connorjward commented Jan 23, 2025 • edited Loading

github-actions bot commented Jan 23, 2025 • edited Loading

github-actions bot commented Jan 23, 2025 • edited Loading

connorjward Jan 28, 2025

Choose a reason for hiding this comment

ksagiyam Jan 28, 2025

Choose a reason for hiding this comment

connorjward Jan 28, 2025

Choose a reason for hiding this comment

connorjward Jan 28, 2025

Choose a reason for hiding this comment

connorjward Jan 31, 2025

Choose a reason for hiding this comment

connorjward commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 23, 2025 •

edited

Loading