Replies: 1 comment
-
|
In principle, the multi-box ParallelFor version has some additional overhead from needing to choose the correct box to use and setting up the data structure for that. I never benchmarked this in detail; however, I would expect the multi-box version to be better if there are a lot of tiny boxes and the normal version to be faster if there only is a single box. Maybe the main reason it is not used everywhere is historical, as in it was added later and not everything was converted. Technically, if there are multiple ParallelFors in an MFIter loop, the normal version could better take advantage of CPU/GPU cache due to the tiling done by MFIter. However, again, I am not sure if this effect is actually measurable. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been looking through some of the code for common amrex operations, average_down,crse_init,fine_add etc. and noticed that in some instances the parallelfor is inside and mfiter and other times the GPU branch launches using a single parallelFor that iterates over boxes.
At first glance, it seems reasonable that the second option would result in less kernel launches when there are many boxes and thus reduce kernel launch overhead which seems appealing. However, not all of the above operations use that parallelFor, fineAdd as an example.
So I have two questions, is there an algorithmic reason that the parallelfor over boxes isn't used everywhere for GPU? And would I expect to see a speedup on GPU in my code if I replaced my mfiter loops with the parallelfor over boxes in the case where there are many boxes?
Currently I have dropped the gridding efficiency on GPU builds to avoid too many boxes and launch overhead but this can result in large areas of refinement that aren't strictly needed for the problem.
Beta Was this translation helpful? Give feedback.
All reactions