step_size Default Value in preprocess #1275
Replies: 3 comments
-
I would argue that 10_000 is very small. 100_000 or 200_000 are typical good step sizes unless you end up reading all the branches somehow but that would be bad anyway. I'm not sure if the default should specify something like that though. One may have a lot of RAM and just wants to check if files are there only with |
Beta Was this translation helpful? Give feedback.
-
I think that the best way forward is to make it so that Is that a reasonable compromise? |
Beta Was this translation helpful? Give feedback.
-
That sounds good to me! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I was working with a newer user and discovered that the default behavior for
preprocess
whenstep_size
is unset appears to be that each chunk is a whole file. (It's very possible I'm misunderstanding what's going on in get_steps and its associated map_partitions call). I haven't seen this cause any tension at thepreprocess
level, but if you try to compute with this preprocessed fileset on a distributed cluster, I think this may lead to workers quickly running out of memory.In our case, we are computing on a Condor cluster, where worker memory requests are usually in the 2 GB range, especially for simpler, early-stage analysis code. With default kwargs for
preprocess
, we were seeing workers run out of memory with a rather simple processor (kinematic cuts -> make a plot). Once we setstep_size
to 10,000, we were back up and running.It seems like a common suggestion for running on eg a Condor cluster is that
step_size
is around 10,000. Is there any disadvantage to this being the new default?I recognize that my group's workflow isn't the only workflow, so I'm happy to leave the default as-is and just document things better within my own group. But I at least thought I'd ask and see what the community opinion is :)
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions