-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build program
times increasing with rank count on Mac when caching is enabled
#731
Comments
Here's a breakdown of what's happening inside
The slowdown appears to be coming from these calls. Timing the two separately, it looks like the second one specifically is to blame. |
I think those |
Based on @matthiasdiener's comment and our discussion this morning, I made some more measurements, this time on the whole compile time. Specifically, I compared the first step time of grudge wave for:
If I understand correctly, the main time difference between these should come down to the cache writing time. Here's what I see (same setup as before, with rank-local cache dirs; also, I am manually applying the changes from #716, which don't seem to have made it to the version on conda yet): The scaling is not good, but could be due to DAG splat. Additionally, it seems as if the cache writing is taking a lot of time. However, if I add a (unused) call to which suggests that most of the time is coming from the |
It sure looks that way. It might require duplicate compilation in pocl? (I'm not sure where, but your second graph is enough for me.) Based on this, I think we should definitely turn off pyopencl's CL binary caching for pocl. PR? It might also be worthwhile to understand what pocl is doing under the hood. |
I think what happens is the following: Example pyopencl code: import numpy as np import pyopencl as cl import pyopencl.array as cl_array rng = np.random.default_rng() a = rng.random(50000, dtype=np.float32) b = rng.random(50000, dtype=np.float32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) a_dev = cl_array.to_device(queue, a) b_dev = cl_array.to_device(queue, b) dest_dev = cl_array.empty_like(a_dev) prg = cl.Program(ctx, """ __kernel void sum(__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; """ + "c[gid] = a[gid] + b[gid];"*1000 + "}" ).build() knl = prg.sum # Use this Kernel object for repeated calls knl(queue, a.shape, None, a_dev.data, b_dev.data, dest_dev.data) assert np.allclose(dest_dev.get(), a + b)
I haven't found a way to disable this behavior. |
Thanks for doing more digging here, @matthiasdiener! While we didn't decode that a "generic" kernel was being built, we did track down Important question: are all these conclusions still valid for the Nvidia target? They seem device-unspecific, but I don't know how a generic kernel would be different from a size-specific one in the GPU case. At any rate, at least for CPU, we can probably save time by skipping pyopencl's binary cache if we're working with pocl. |
Seems like the time spent in |
The times reported by the
build program: kernel '<name>' was part of a lengthy source build resulting from a binary cache miss (<time>)
output appear to increase fairly dramatically with rank count on my Mac with caching enabled, even when using rank-local cache directories. For example, when running thewave-op-mpi
example in grudge, with 16 ranks and caching disabled viaPYOPENCL_NO_CACHE=1
, I see:With caching enabled (and empty cache) I see:
(Note:
rhs
is missing from the first output, presumably because the time is below the output threshold. The lack offrozen_nodes0_2d
in the second output is confusing though.)If I increase to 16 ranks, with no caching I see:
(again no
rhs
). And with caching I see:(full
build program
output here).If I profile with
pyinstrument
, I see an increase in time spent inProgram.build
inside grudge's_DistributedCompiledFunction.__call__
. Here's the profiling output without caching:and with:
Here's the script I'm using to run the example:
(run with
rm -rf .cache && mpiexec -n 4 bash run.sh
.)I haven't been able to try running this on Lassen yet to see if I get the same behavior there; I'm currently running into some environment issues.
cc @matthiasdiener
The text was updated successfully, but these errors were encountered: