Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion snakemake_storage_plugin_s3/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ def retrieve_object(self):
self.s3obj().download_file(self.local_path())

def is_dir(self):
if self._is_dir is None:
if not self._is_dir:
self._is_dir = any(self.get_subkeys())
return self._is_dir
Comment on lines +329 to 331
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Rechecking S3 on every call when directory doesn't exist creates performance overhead.

The change from if self._is_dir is None: to if not self._is_dir: solves the AWS Batch stale-cache problem by rechecking S3 whenever _is_dir is False. However, this means that for non-existent or empty directories, every subsequent is_dir() call will query S3 again, which is wasteful.

The root cause is that _is_dir is cached across remote state changes (e.g., when a Batch job creates files). Better alternatives might include:

  1. Invalidate the cache when appropriate: Reset _is_dir = None after job completion or when --latency-wait expires.
  2. Time-based cache expiration: Track when _is_dir was last populated and recheck after a timeout.
  3. Separate "checked" from "is directory" state: Use a separate flag to track whether the check has been performed vs. what the result was.

Additionally, mixing a None-check (uninitialized state) with a truthiness check (value comparison) makes the caching strategy unclear. Consider explicitly handling the three states: uninitialized (None), not a directory (False), and is a directory (True).

To verify the performance impact, run this script to identify how often is_dir() is called and whether the rechecking behavior is triggered frequently:


🏁 Script executed:

#!/bin/bash
# Description: Search for is_dir() calls and usage patterns to assess performance impact

# Find all calls to is_dir() in the codebase
echo "=== Locations calling is_dir() ==="
rg -n 'is_dir\(\)' -A 2 -B 2

# Check if there are any cache invalidation patterns
echo -e "\n=== Patterns that might invalidate _is_dir cache ==="
rg -n '_is_dir\s*=' -A 1 -B 1

# Look for any retry or latency-wait mechanisms
echo -e "\n=== Latency wait and retry mechanisms ==="
rg -n 'latency.wait|retry' -A 2 -B 2

Length of output: 4565


Separate cache initialization from value and prevent repeated S3 queries

Changing if self._is_dir is None to if not self._is_dir forces every is_dir() call on an empty or non-existent prefix (used in exists, mtime, size, retrieve_object, remove) to requery S3, causing substantial overhead. Instead:

  • Keep the three-state logic: uninitialized (None), directory (True), not-a-directory (False).
  • Invalidate or expire the cache explicitly (e.g., reset _is_dir = None after job completion or use a time-based TTL).
  • Or track a separate “checked” flag versus the actual boolean result.

Refactor the cache logic accordingly to avoid S3 thrashing.

🤖 Prompt for AI Agents
In snakemake_storage_plugin_s3/__init__.py around lines 329 to 331, the current
check uses "if not self._is_dir" which collapses the three-state cache into a
boolean and causes repeated S3 queries for empty/non-existent prefixes; change
the logic to preserve the uninitialized state by checking "if self._is_dir is
None" before calling self.get_subkeys(), set self._is_dir to True/False based on
the result, and implement explicit cache invalidation (reset self._is_dir =
None) after job completion or add a separate checked flag or a TTL timestamp to
avoid thrashing S3 while keeping correct cached values.


Expand Down