fix: Fix spurious missing output error when using AWS Batch executor #58

coderabbitai · 2025-10-14T18:20:40Z

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Rechecking S3 on every call when directory doesn't exist creates performance overhead.

The change from if self._is_dir is None: to if not self._is_dir: solves the AWS Batch stale-cache problem by rechecking S3 whenever _is_dir is False. However, this means that for non-existent or empty directories, every subsequent is_dir() call will query S3 again, which is wasteful.

The root cause is that _is_dir is cached across remote state changes (e.g., when a Batch job creates files). Better alternatives might include:

Invalidate the cache when appropriate: Reset _is_dir = None after job completion or when --latency-wait expires.

Time-based cache expiration: Track when _is_dir was last populated and recheck after a timeout.

Separate "checked" from "is directory" state: Use a separate flag to track whether the check has been performed vs. what the result was.

Additionally, mixing a None-check (uninitialized state) with a truthiness check (value comparison) makes the caching strategy unclear. Consider explicitly handling the three states: uninitialized (None), not a directory (False), and is a directory (True).

To verify the performance impact, run this script to identify how often is_dir() is called and whether the rechecking behavior is triggered frequently:

🏁 Script executed:

#!/bin/bash # Description: Search for is_dir() calls and usage patterns to assess performance impact # Find all calls to is_dir() in the codebase echo "=== Locations calling is_dir() ===" rg -n 'is_dir\(\)' -A 2 -B 2 # Check if there are any cache invalidation patterns echo -e "\n=== Patterns that might invalidate _is_dir cache ===" rg -n '_is_dir\s*=' -A 1 -B 1 # Look for any retry or latency-wait mechanisms echo -e "\n=== Latency wait and retry mechanisms ===" rg -n 'latency.wait|retry' -A 2 -B 2

Length of output: 4565

Separate cache initialization from value and prevent repeated S3 queries

Changing if self._is_dir is None to if not self._is_dir forces every is_dir() call on an empty or non-existent prefix (used in exists, mtime, size, retrieve_object, remove) to requery S3, causing substantial overhead. Instead:

Keep the three-state logic: uninitialized (None), directory (True), not-a-directory (False).

Invalidate or expire the cache explicitly (e.g., reset _is_dir = None after job completion or use a time-based TTL).

Or track a separate “checked” flag versus the actual boolean result.

Refactor the cache logic accordingly to avoid S3 thrashing.

🤖 Prompt for AI Agents

In snakemake_storage_plugin_s3/__init__.py around lines 329 to 331, the current check uses "if not self._is_dir" which collapses the three-state cache into a boolean and causes repeated S3 queries for empty/non-existent prefixes; change the logic to preserve the uninitialized state by checking "if self._is_dir is None" before calling self.get_subkeys(), set self._is_dir to True/False based on the result, and implement explicit cache invalidation (reset self._is_dir = None) after job completion or add a separate checked flag or a TTL timestamp to avoid thrashing S3 while keeping correct cached values.

-Original file line number
+Diff line change
@@ Expand Up / @@ -326,7 +326,7 @@ def retrieve_object(self): @@
                 self.s3obj().download_file(self.local_path())
         def is_dir(self):
-            if self._is_dir is None:
+            if not self._is_dir:
                 self._is_dir = any(self.get_subkeys())
             return self._is_dir
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix spurious missing output error when using AWS Batch executor #58

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai bot Oct 14, 2025

Uh oh!

fix: Fix spurious missing output error when using AWS Batch executor #58

Are you sure you want to change the base?

Uh oh!

fix: Fix spurious missing output error when using AWS Batch executor #58

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!