-
Notifications
You must be signed in to change notification settings - Fork 16
fix: Fix spurious missing output error when using AWS Batch executor #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdjusted S3 path directory-detection cache initialization: replaced an explicit None check for _is_dir with a truthiness check, changing when the cache is populated and potentially returning None from is_dir() instead of a boolean. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant S3Path as S3Path
participant S3 as S3 API
Caller->>S3Path: is_dir()
alt Cache check (new)
note over S3Path: if not self._is_dir
S3Path->>S3: get_subkeys()
S3-->>S3Path: subkeys list
S3Path->>S3Path: self._is_dir = any(subkeys)
else Cache assumed truthy/False
note over S3Path: Skips population if self._is_dir is None (falsy?) or False handling differs
end
S3Path-->>Caller: self._is_dir (may be None/True/False)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_storage_plugin_s3/__init__.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
snakemake_storage_plugin_s3/__init__.py
| if not self._is_dir: | ||
| self._is_dir = any(self.get_subkeys()) | ||
| return self._is_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
Rechecking S3 on every call when directory doesn't exist creates performance overhead.
The change from if self._is_dir is None: to if not self._is_dir: solves the AWS Batch stale-cache problem by rechecking S3 whenever _is_dir is False. However, this means that for non-existent or empty directories, every subsequent is_dir() call will query S3 again, which is wasteful.
The root cause is that _is_dir is cached across remote state changes (e.g., when a Batch job creates files). Better alternatives might include:
- Invalidate the cache when appropriate: Reset
_is_dir = Noneafter job completion or when--latency-waitexpires. - Time-based cache expiration: Track when
_is_dirwas last populated and recheck after a timeout. - Separate "checked" from "is directory" state: Use a separate flag to track whether the check has been performed vs. what the result was.
Additionally, mixing a None-check (uninitialized state) with a truthiness check (value comparison) makes the caching strategy unclear. Consider explicitly handling the three states: uninitialized (None), not a directory (False), and is a directory (True).
To verify the performance impact, run this script to identify how often is_dir() is called and whether the rechecking behavior is triggered frequently:
🏁 Script executed:
#!/bin/bash
# Description: Search for is_dir() calls and usage patterns to assess performance impact
# Find all calls to is_dir() in the codebase
echo "=== Locations calling is_dir() ==="
rg -n 'is_dir\(\)' -A 2 -B 2
# Check if there are any cache invalidation patterns
echo -e "\n=== Patterns that might invalidate _is_dir cache ==="
rg -n '_is_dir\s*=' -A 1 -B 1
# Look for any retry or latency-wait mechanisms
echo -e "\n=== Latency wait and retry mechanisms ==="
rg -n 'latency.wait|retry' -A 2 -B 2Length of output: 4565
Separate cache initialization from value and prevent repeated S3 queries
Changing if self._is_dir is None to if not self._is_dir forces every is_dir() call on an empty or non-existent prefix (used in exists, mtime, size, retrieve_object, remove) to requery S3, causing substantial overhead. Instead:
- Keep the three-state logic: uninitialized (
None), directory (True), not-a-directory (False). - Invalidate or expire the cache explicitly (e.g., reset
_is_dir = Noneafter job completion or use a time-based TTL). - Or track a separate “checked” flag versus the actual boolean result.
Refactor the cache logic accordingly to avoid S3 thrashing.
🤖 Prompt for AI Agents
In snakemake_storage_plugin_s3/__init__.py around lines 329 to 331, the current
check uses "if not self._is_dir" which collapses the three-state cache into a
boolean and causes repeated S3 queries for empty/non-existent prefixes; change
the logic to preserve the uninitialized state by checking "if self._is_dir is
None" before calling self.get_subkeys(), set self._is_dir to True/False based on
the result, and implement explicit cache invalidation (reset self._is_dir =
None) after job completion or add a separate checked flag or a TTL timestamp to
avoid thrashing S3 while keeping correct cached values.
|
I made another fix that is slightly better regarding the cache invalidation issue you mentioned: #60 . |
|
I was hit with the same issue and did some digging #61 More details in that issue thread, but basically the reason it works when executing locally is that snakemake just uses local file paths when checking for existence/missing when running locally, even if remote storage is used. This is because all files for remote storage are initially created locally. So it'll just check if the path exists like would be nice if they merge in a fix soon! clearly this issue is snagging quite a few people. |
When using this plugin in conjunction with the snakemake-executor-plugin-aws-batch, and a
directoryoutput from a job executed on AWS Batch, I receive the following error:However, the expected output objects within this prefix are created as expected from the job executed on batch. Increasing the
--latency-waitparameter does not change the behavior.I'm uncertain if this is really the "source" of the error, since it does not happen when running locally(still using the s3 storage plugin), but this change to recompute
_is_dirof theStorageObjectwhen it is Falsy instead of justNoneseems to resolve the issue. Someone more familiar with the internals of snakemake-interface-storage-plugins/snakemake-interface-executor-plugins/snakemake may be able to point to a more appropriate place to address this issue. Querying s3 more than necessary is obviously not the most elegant solution.I'm including a Snakefile that recreates the issue, a working AWS Batch queue/compute environment and usage of a default s3 storage prefix is required to recreate the issue. I am using the latest(as of 14Oct2025) versions of
snakemakeand mentioned plugins, Python3.13, on a Ubuntu 24.04 based docker image.Summary by CodeRabbit