fix: Fix spurious missing output error when using AWS Batch executor #58

phmoferring · 2025-10-14T18:16:49Z

When using this plugin in conjunction with the snakemake-executor-plugin-aws-batch, and a directory output from a job executed on AWS Batch, I receive the following error:

OSError: Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
s3://bucket/default/storage/prefix/directory_batch (missing in storage)

However, the expected output objects within this prefix are created as expected from the job executed on batch. Increasing the --latency-wait parameter does not change the behavior.

I'm uncertain if this is really the "source" of the error, since it does not happen when running locally(still using the s3 storage plugin), but this change to recompute _is_dir of the StorageObject when it is Falsy instead of just None seems to resolve the issue. Someone more familiar with the internals of snakemake-interface-storage-plugins/snakemake-interface-executor-plugins/snakemake may be able to point to a more appropriate place to address this issue. Querying s3 more than necessary is obviously not the most elegant solution.

I'm including a Snakefile that recreates the issue, a working AWS Batch queue/compute environment and usage of a default s3 storage prefix is required to recreate the issue. I am using the latest(as of 14Oct2025) versions of snakemake and mentioned plugins, Python3.13, on a Ubuntu 24.04 based docker image.

rule all:
    input:
        "directory_batch",
        "directory_local",

# Works fine when running locally, produces error after job completion on AWS Batch
rule directory_output_batch:
    output:
        directory_batch = directory("directory_batch"),
    shell:
        """
        mkdir -p {output.directory_batch}
        touch {output.directory_batch}/output_file
        """

# Produces no error when running fully locally or on AWS Batch
rule directory_output_local:
    localrule: True
    output:
        directory_local = directory("directory_local"),
    shell:
        """
        mkdir -p {output.directory_local}
        touch {output.directory_local}/output_file
        """

Summary by CodeRabbit

Bug Fixes
- Improved directory detection in the S3 storage plugin to ensure consistent boolean results from directory checks, preventing unexpected None values that could disrupt control flow.
- Enhances reliability of operations that depend on directory status, such as listing contents, conditional processing, and workflow decisions in S3-backed paths.

coderabbitai · 2025-10-14T18:17:16Z

📝 Walkthrough

Walkthrough

Adjusted S3 path directory-detection cache initialization: replaced an explicit None check for _is_dir with a truthiness check, changing when the cache is populated and potentially returning None from is_dir() instead of a boolean.

Changes

Cohort / File(s)	Summary
S3 directory flag caching `snakemake_storage_plugin_s3/__init__.py`	Replaced `if self._is_dir is None` with `if not self._is_dir` before populating `self._is_dir = any(self.get_subkeys())`, altering cache initialization and possible return value of `is_dir()`

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant S3Path as S3Path
  participant S3 as S3 API

  Caller->>S3Path: is_dir()
  alt Cache check (new)
    note over S3Path: if not self._is_dir
    S3Path->>S3: get_subkeys()
    S3-->>S3Path: subkeys list
    S3Path->>S3Path: self._is_dir = any(subkeys)
  else Cache assumed truthy/False
    note over S3Path: Skips population if self._is_dir is None (falsy?) or False handling differs
  end
  S3Path-->>Caller: self._is_dir (may be None/True/False)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

johanneskoester

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly describes the primary change—fixing a spurious missing output error with the AWS Batch executor—and aligns directly with the problem outlined in the PR description.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d067f40 and 8ec8514.

📒 Files selected for processing (1)

snakemake_storage_plugin_s3/__init__.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

snakemake_storage_plugin_s3/__init__.py

coderabbitai · 2025-10-14T18:20:40Z

snakemake_storage_plugin_s3/__init__.py

+        if not self._is_dir:
            self._is_dir = any(self.get_subkeys())
        return self._is_dir


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Rechecking S3 on every call when directory doesn't exist creates performance overhead.

The change from if self._is_dir is None: to if not self._is_dir: solves the AWS Batch stale-cache problem by rechecking S3 whenever _is_dir is False. However, this means that for non-existent or empty directories, every subsequent is_dir() call will query S3 again, which is wasteful.

The root cause is that _is_dir is cached across remote state changes (e.g., when a Batch job creates files). Better alternatives might include:

Invalidate the cache when appropriate: Reset _is_dir = None after job completion or when --latency-wait expires.

Time-based cache expiration: Track when _is_dir was last populated and recheck after a timeout.

Separate "checked" from "is directory" state: Use a separate flag to track whether the check has been performed vs. what the result was.

Additionally, mixing a None-check (uninitialized state) with a truthiness check (value comparison) makes the caching strategy unclear. Consider explicitly handling the three states: uninitialized (None), not a directory (False), and is a directory (True).

To verify the performance impact, run this script to identify how often is_dir() is called and whether the rechecking behavior is triggered frequently:

🏁 Script executed:

#!/bin/bash # Description: Search for is_dir() calls and usage patterns to assess performance impact # Find all calls to is_dir() in the codebase echo "=== Locations calling is_dir() ===" rg -n 'is_dir\(\)' -A 2 -B 2 # Check if there are any cache invalidation patterns echo -e "\n=== Patterns that might invalidate _is_dir cache ===" rg -n '_is_dir\s*=' -A 1 -B 1 # Look for any retry or latency-wait mechanisms echo -e "\n=== Latency wait and retry mechanisms ===" rg -n 'latency.wait|retry' -A 2 -B 2

Length of output: 4565

Separate cache initialization from value and prevent repeated S3 queries

Changing if self._is_dir is None to if not self._is_dir forces every is_dir() call on an empty or non-existent prefix (used in exists, mtime, size, retrieve_object, remove) to requery S3, causing substantial overhead. Instead:

Keep the three-state logic: uninitialized (None), directory (True), not-a-directory (False).

Invalidate or expire the cache explicitly (e.g., reset _is_dir = None after job completion or use a time-based TTL).

Or track a separate “checked” flag versus the actual boolean result.

Refactor the cache logic accordingly to avoid S3 thrashing.

🤖 Prompt for AI Agents

In snakemake_storage_plugin_s3/__init__.py around lines 329 to 331, the current check uses "if not self._is_dir" which collapses the three-state cache into a boolean and causes repeated S3 queries for empty/non-existent prefixes; change the logic to preserve the uninitialized state by checking "if self._is_dir is None" before calling self.get_subkeys(), set self._is_dir to True/False based on the result, and implement explicit cache invalidation (reset self._is_dir = None) after job completion or add a separate checked flag or a TTL timestamp to avoid thrashing S3 while keeping correct cached values.

gilbsgilbs · 2025-11-13T15:47:05Z

I made another fix that is slightly better regarding the cache invalidation issue you mentioned: #60 .

deto · 2025-11-21T19:52:21Z

I was hit with the same issue and did some digging #61

More details in that issue thread, but basically the reason it works when executing locally is that snakemake just uses local file paths when checking for existence/missing when running locally, even if remote storage is used. This is because all files for remote storage are initially created locally. So it'll just check if the path exists like .snakemake/storage/s3/<bucket>/my_dir/, ignoring the storage plugin code completely.

would be nice if they merge in a fix soon! clearly this issue is snagging quite a few people.

Fix spurious missing output error

8ec8514

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

gilbsgilbs mentioned this pull request Nov 13, 2025

fix: directory detection with remote executor #60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix spurious missing output error when using AWS Batch executor #58

fix: Fix spurious missing output error when using AWS Batch executor #58

Uh oh!

phmoferring commented Oct 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 14, 2025

Uh oh!

gilbsgilbs commented Nov 13, 2025

Uh oh!

deto commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Fix spurious missing output error when using AWS Batch executor #58

Are you sure you want to change the base?

fix: Fix spurious missing output error when using AWS Batch executor #58

Uh oh!

Conversation

phmoferring commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gilbsgilbs commented Nov 13, 2025

Uh oh!

deto commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

phmoferring commented Oct 14, 2025 •

edited

Loading

coderabbitai bot commented Oct 14, 2025 •

edited

Loading