Non-uniform processor allocation in domain-decomposed simulations #246

alexandermote · 2024-10-10T23:47:32Z

Opening a draft PR to get fresh eyes on my new DD code. We can now handle decomposed mesh tallies of varying sizes, and the dd_slab_reed test should pass when run with 4 processors. Currently working on getting dd_slab_reed and dd_cooper to pass when run with multiple processors per subdomain; from there, we should be able to add non-uniform work ratios fairly easily.

alexandermote · 2024-10-11T18:09:10Z

Apparently I wasn't running on the most up-to-date version of dev; as a result, it looks like the DD code isn't working. I'll look into it and see if I can figure out where the issue is.

alexandermote · 2024-10-11T19:16:03Z

The tally values I get from this version are identical to the ones I was getting on my branch; this made me wonder if the answer.h5 file had somehow changed. Running the dd_slab_reed problem without domain decomposition, and using that output as the answer.h5 for the regression test, causes the test to pass again. It's possible that the changes I made to the dd_slab_reed input caused its output to change from the existing version, but since the new version is accurate to a non-DD simulation, I believe it is the correct output.

alexandermote · 2024-10-15T23:31:49Z

The dd_slab_reed test succeeds in Python and Numba modes on my Dane build; the tests fail on Github because it is trying to run with only 1 processor. Not sure why that's happening; my only guess is it's a difference between calling --mpiexec and --srun?

ilhamv · 2025-01-06T11:01:19Z

@alexandermote , I made some updates. Particularly, I changed how we distribute source particles locally with multiple processors per domain.

Let's use test/regressionslab_reed_dd for testing. It runs all right with 4 processors (1 processor per domain). It also seems to run OK with 8 processors (2 processors per domain); however, it does not pass the mesh tally merging at the end of the simulation, which is possibly the only issue left. I see that you already put some mechanics that treat the multiprocessors-per-node tally merging with MPI rank grouping, but it seems that it needs checking there as it is where the error is located (run with 8 processors to reproduce the error).

alexandermote · 2025-01-07T00:50:16Z

@ilhamv
I believe I have fixed the tally merging issue. I ran dd_slab_reed with 4, 8, and 16 processors, and achieved identical results across all 3.
I have noticed that when I run dd_slab_reed without domain decomposition enabled, it now produces a different result than we get from the simulations with domain decomposition active. If you have time to see if you can reproduce that issue, I'd appreciate it. I will work on getting non-uniform processor allocation working next.

ilhamv · 2025-01-07T08:11:56Z

@alexandermote , the test slab_reed is the one without domain decomposition, and it passes with reproducibility. So, I think we are good with the multiples.

Do you plan to test and include the dd_cooper for multidimensional domain decomposition test?

alexandermote · 2025-01-07T23:14:45Z

@ilhamv:
I get different results between slab_reed and dd_slab_reed; not sure why. I built a 3D version of Reed's problem that I used to verify 3D domain decomposition in my M&C paper. I'll add it to this repo in the next push.
I noticed a discrepancy when running with work_ratio active, which has led to me rewriting a couple of pieces of the code, including the local sourcing code you added. I'm working on making sure the standard deviation values are equivalent across multiple processor allocations, and then I will push the changes. This push will also add support for non-uniform processor allocation.

Modified dd_slab_reed DD mesh and revised DD tally gather

28fb30c

alexandermote and others added 5 commits October 11, 2024 12:17

Updated dd_slab_reed answer file

d486be1

fixing some numba errors

8c60903

numba fix + back in black

518090d

Merge branch 'dev' of https://github.com/alexandermote/MCDC into dev

fa128d5

back in black

e498ccd

alexandermote and others added 6 commits October 15, 2024 21:30

added mesh recomposition in postprocessing

671b276

Merge branch 'dev' into alexandermote/dev

044c21a

add slab_reed for reproducibility test

491b0ee

add serial slab_reed. minor changes.

700e2fa

remove dd_repro

579dd65

update local sourcing for domain decomposition

e746074

Fixed tally recomposition for multi-processor subdomains

60ee48f

edit regression test runner to consider updated domain decomp

781c3af

alexandermote and others added 5 commits January 7, 2025 15:36

Revised work ratio calculations and subdomain sourcing

99f0d49

added slab_reed_dd_3d to regression tests

e5b4586

Merge branch 'dev' of https://github.com/alexandermote/MCDC into dev

019b4d7

update regression dd test selection

c049dbd

back in black...

8327ef3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-uniform processor allocation in domain-decomposed simulations #246

Non-uniform processor allocation in domain-decomposed simulations #246

alexandermote commented Oct 10, 2024

alexandermote commented Oct 11, 2024

alexandermote commented Oct 11, 2024

alexandermote commented Oct 15, 2024

ilhamv commented Jan 6, 2025

alexandermote commented Jan 7, 2025

ilhamv commented Jan 7, 2025

alexandermote commented Jan 7, 2025

Non-uniform processor allocation in domain-decomposed simulations #246

Are you sure you want to change the base?

Non-uniform processor allocation in domain-decomposed simulations #246

Conversation

alexandermote commented Oct 10, 2024

alexandermote commented Oct 11, 2024

alexandermote commented Oct 11, 2024

alexandermote commented Oct 15, 2024

ilhamv commented Jan 6, 2025

alexandermote commented Jan 7, 2025

ilhamv commented Jan 7, 2025

alexandermote commented Jan 7, 2025