-
Notifications
You must be signed in to change notification settings - Fork 759
Description
New feature
Hello Nextflow maintainers,
thank you very much for developing Nextflow and for the continuous effort in maintaining it.
I would like to propose a new process-level control to prevent large intermediate outputs from accumulating in work/ when an upstream process is much faster than the downstream one.
In short: a per-process cap on “in-flight outputs” (outputs already produced by a process but not yet fully consumed/drained by downstream), to provide backpressure and bound peak disk usage.
Use case
Main scenario is HPC (or any quota-limited environment) where work/ is on a filesystem with strict disk quota.
Typical pattern:
- Process A: fast, produces very large output per sample (e.g. 10–100 GB)
- Process B: slow (or low parallelism), consumes that output
- With ~100 samples, Process A can finish quickly and produce many huge outputs, while Process B is still processing only a few.
- These outputs must stay in
work/until downstream is finished, so temporary disk usage grows a lot and can exceed quota.
I also use nf-boost to minimise work/, but it cannot remove those intermediates early because they are still required by downstream tasks. So the peak usage problem still happens.
Why existing feature do not solve it:
executor.queueSizeis global; it does not enforce fairness between processes (fast upstream can fill it first).- Reducing upstream
maxForksis possible but often requires manual tuning and can reduce throughput; also it does not address submission starvation in a general way. nf-boosthelps only when files are no longer needed; it cannot prevent accumulation when downstream is slow.
Related discussions:
- Yet another "huge intermediate files" issue #2468 (huge intermediate files; merging processes often suggested but not always desirable)
- Remove single processed files before the end of the whole process #1754 (ideas about earlier cleanup / downstream start; same pain point but different direction)
Suggested implementation
Add a per-process directive (name is flexible) that limits how many outputs from that process can be “in flight” at the same time.
Example (illustrative):
process {
withName: PROCESS_A {
maxInFlight = 5
}
}
Proposed semantics:
- For a process with
maxInFlight = N, Nextflow should allow at most N tasks worth of outputs to be produced and remain “pending” for downstream. - If the limit is reached, new tasks of that process should not start / not be submitted, until downstream has progressed enough that some previously produced outputs are no longer pending (e.g. all dependent downstream tasks for those outputs have completed successfully).
- This is not asking to delete files unsafely: it is only restricting ahead-of-time production to keep disk usage bounded.
Implementation building blocks:
- Maintain an “in-flight counter” per process (or per output channel edge), increment when a task of that process completes and materialises its outputs.
- Decrement when all downstream tasks that consume that output are completed (for simple linear A→B it is just when the corresponding B task finishes; for branching, when all dependants finish).
- Integrate this check into the task dispatch / submission logic so that producers can be paused when they outrun consumers.
Environment:
- Nextflow version: v25.10.2
- Executor: slurm
- nf-boost version: v0.6.0
Thanks again for your time.