step_size Default Value in preprocess #1275

rpsimeon34 · 2025-02-13T15:08:05Z

rpsimeon34
Feb 13, 2025

Hello, I was working with a newer user and discovered that the default behavior for preprocess when step_size is unset appears to be that each chunk is a whole file. (It's very possible I'm misunderstanding what's going on in get_steps and its associated map_partitions call). I haven't seen this cause any tension at the preprocess level, but if you try to compute with this preprocessed fileset on a distributed cluster, I think this may lead to workers quickly running out of memory.

In our case, we are computing on a Condor cluster, where worker memory requests are usually in the 2 GB range, especially for simpler, early-stage analysis code. With default kwargs for preprocess, we were seeing workers run out of memory with a rather simple processor (kinematic cuts -> make a plot). Once we set step_size to 10,000, we were back up and running.

It seems like a common suggestion for running on eg a Condor cluster is that step_size is around 10,000. Is there any disadvantage to this being the new default?

I recognize that my group's workflow isn't the only workflow, so I'm happy to leave the default as-is and just document things better within my own group. But I at least thought I'd ask and see what the community opinion is :)

Thanks!

ikrommyd · 2025-02-13T15:56:59Z

ikrommyd
Feb 13, 2025
Maintainer

I would argue that 10_000 is very small. 100_000 or 200_000 are typical good step sizes unless you end up reading all the branches somehow but that would be bad anyway. I'm not sure if the default should specify something like that though. One may have a lot of RAM and just wants to check if files are there only with preprocess and doesn't want to split into chunks. I think it should be an intentional choice for the user to split into chunks of X size.

0 replies

lgray · 2025-02-13T16:45:33Z

lgray
Feb 13, 2025
Maintainer

I think that the best way forward is to make it so that None is a conscious choice. We can use an unset flag that will raise an error step_size is not set. Then None can mean whole files are chunks and a number is the typical step size setting.

Is that a reasonable compromise?

0 replies

rpsimeon34 · 2025-02-14T10:46:50Z

rpsimeon34
Feb 14, 2025
Author

That sounds good to me!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

step_size Default Value in preprocess #1275

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

step_size Default Value in preprocess #1275

Uh oh!

rpsimeon34 Feb 13, 2025

Replies: 3 comments

Uh oh!

ikrommyd Feb 13, 2025 Maintainer

Uh oh!

lgray Feb 13, 2025 Maintainer

Uh oh!

rpsimeon34 Feb 14, 2025 Author

rpsimeon34
Feb 13, 2025

ikrommyd
Feb 13, 2025
Maintainer

lgray
Feb 13, 2025
Maintainer

rpsimeon34
Feb 14, 2025
Author