fix: directory detection with remote executor #60

gilbsgilbs · 2025-11-13T15:03:37Z

Previously, directories created during a Snakemake rule were initially cached as non-directories due to their nascent state in storage. Using a local executor, this would not be an issue because the cache would be invalidated during upload. However, this is not the case with remote executors. Therefore, Snakemake's would incorrectly report these directories as non-existent, causing rule failure for no reason.

This commit resolves the problem by deferring the caching of file types until their existence and type (file or directory) are definitive. This ensures that the plugin always returns a definitive state.

Summary by CodeRabbit

New Features
- Added a direct check to determine whether remote S3 objects are files.
Bug Fixes
- More reliable detection of file vs. directory for S3 storage objects.
- Improved object existence checks for better validation and fewer false negatives.
Refactor
- Internal status tracking streamlined for consistent behavior across store/retrieve operations.

coderabbitai · 2025-11-13T15:04:17Z

📝 Walkthrough

Walkthrough

The PR replaces the boolean _is_dir flag in StorageObject with an internal _FileType enum (UNKNOWN/FILE/DIRECTORY), adds is_file(), refactors is_dir() and exists() to use the enum and S3 probes, and updates store_object() and retrieve_object() to maintain the new type cache.

Changes

Cohort / File(s)	Summary
File type state management `snakemake_storage_plugin_s3/__init__.py`	Added `_FileType` enum (UNKNOWN/FILE/DIRECTORY) and replaced `_is_dir` with `_file_type` initialized to `_FileType.UNKNOWN` in `__post_init__()`
Type detection methods `snakemake_storage_plugin_s3/__init__.py`	Added `is_file()` method; reworked `is_dir()` to use `_file_type`, subkey/bucket checks and update cache; simplified `exists()` to return `is_file() or is_dir()`
Storage operations `snakemake_storage_plugin_s3/__init__.py`	`store_object()` now sets `_file_type` to DIRECTORY or FILE based on local path; `retrieve_object()` uses `is_file()`/`is_dir()` flow and updated type-tracking

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant StorageObject
    participant S3 as S3 API

    note over StorageObject: New type-discriminator flow

    Caller->>StorageObject: exists()
    StorageObject->>StorageObject: is_file() or is_dir()
    alt cached as FILE
        StorageObject-->>Caller: True (file)
    else cached as DIRECTORY
        StorageObject-->>Caller: True (directory)
    else UNKNOWN
        StorageObject->>S3: probe via s3obj().load() / list subkeys
        S3-->>StorageObject: object exists / subkeys present / 404
        StorageObject->>StorageObject: update _file_type (FILE/DIRECTORY/UNKNOWN)
        StorageObject-->>Caller: True/False
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas needing extra attention:

is_file() probing and exception handling around s3obj().load()
is_dir() logic for subkey listing and bucket-existence checks
Consistency of _file_type updates across store_object() and retrieve_object()

Possibly related PRs

fix: fix retrieval of nested directories #27 — Modifies retrieve_object and directory-handling logic; closely related to the new type-discriminator and nested-directory retrieval behavior.

Suggested reviewers

johanneskoester

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'fix: directory detection with remote executor' directly aligns with the main objective of fixing directory detection issues specifically when using remote executors.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d067f40 and 7e23eec.

📒 Files selected for processing (1)

snakemake_storage_plugin_s3/__init__.py (6 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

snakemake_storage_plugin_s3/__init__.py

🪛 Ruff (0.14.4)

snakemake_storage_plugin_s3/__init__.py

326-331: Avoid specifying long messages outside the exception class

(TRY003)

335-335: Use raise without specifying exception name

Remove exception name

(TRY201)

snakemake_storage_plugin_s3/__init__.py

Previously, directories created during a Snakemake rule were initially cached as non-directories due to their nascent state in storage. Using a local executor, this would not be an issue because the cache would be invalidated during upload. However, this is not the case with remote executors. Therefore, Snakemake's would incorrectly report these directories as non-existent, causing rule failure for no reason. This commit resolves the problem by deferring the caching of file types until their existence and type (file or directory) are definitive. This ensures that the plugin always returns a definitive state.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

snakemake_storage_plugin_s3/__init__.py (1)
335-335: Use bare raise to preserve traceback.

Replace raise e with bare raise to preserve the full exception traceback for debugging.

As per coding guidelines and static analysis hints.
             else:
-                raise e
+                raise

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e23eec and 7c8acfe.

📒 Files selected for processing (1)

snakemake_storage_plugin_s3/__init__.py (6 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

snakemake_storage_plugin_s3/__init__.py

🪛 Ruff (0.14.4)

snakemake_storage_plugin_s3/__init__.py

326-331: Avoid specifying long messages outside the exception class

(TRY003)

335-335: Use raise without specifying exception name

Remove exception name

(TRY201)

🔇 Additional comments (5)

snakemake_storage_plugin_s3/__init__.py (5)

1-1: LGTM! Clean enum-based type tracking.

The introduction of _FileType enum with UNKNOWN/FILE/DIRECTORY states provides a cleaner three-state type system that enables deferred caching as intended by the fix.

Also applies to: 211-215

236-236: LGTM! Proper initialization.

Initializing _file_type to UNKNOWN correctly defers type determination until the first definitive check.

288-288: LGTM! Clean delegation to type-checking methods.

The simplified implementation correctly leverages short-circuit evaluation and the internal cache to minimize S3 API calls.

340-348: LGTM! Correct directory detection logic.

The method correctly identifies S3 directories by checking for subkeys with the object's key as a prefix, and properly caches the result when definitive.

374-383: LGTM! Proper cache updates after upload.

Correctly caches the file type as DIRECTORY or FILE after upload, when the type is definitively known. This aligns with the PR's objective to cache only when the state is definitive.

coderabbitai · 2025-11-13T15:13:04Z

snakemake_storage_plugin_s3/__init__.py

+        try:
+            self.s3obj().load()
+        except botocore.exceptions.ClientError as e:
+            err_code = e.response["Error"]["Code"]
+            if err_code == "400":
+                raise WorkflowError(
+                    f"Bad request for S3 object {self.query}. This can happen if "
+                    "the query contains unsupported characters (e.g. '..'), "
+                    "if your credentials are invalid, or if your permissions are "
+                    "insufficient."
+                ) from e
+            elif err_code == "404":
+                return False
+            else:
+                raise e


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify boto3 error codes for s3obj().load().

The error code checks compare against "400" and "404" strings. However, boto3's ClientError.response["Error"]["Code"] typically returns semantic error codes (e.g., "NoSuchKey", "BadRequest") rather than HTTP status codes. Please verify that s3obj().load() actually returns these numeric string codes, or update to check the appropriate error codes.

🏁 Script executed:

#!/bin/bash # Description: Search for boto3 error code handling patterns in the codebase # to verify the expected error code format rg -nP --type=py -C3 'response\["Error"\]\["Code"\]'

Length of output: 610

🌐 Web query:

What error codes does boto3 S3 Object.load() return for 404 not found and 400 bad request?

💡 Result:

404 Not Found: boto3 raises a botocore.exceptions.ClientError whose e.response['Error']['Code'] is usually "404" (and S3 may also return "NoSuchKey"). [1][2]

400 Bad Request: boto3 raises ClientError with e.response['Error']['Code'] equal to "400" (or a more specific S3 error code such as "BadRequest", "InvalidArgument", "AuthorizationHeaderMalformed", etc., depending on the exact problem). [2][3]

Sources:
[1] Stack Overflow: examples showing e.response['Error']['Code'] == '404' or 'NoSuchKey'.
[2] AWS S3 HeadObject / Error responses (lists 404/400 behaviors and NoSuchKey / various 4xx error codes).
[3] Boto3 head_object docs (HEAD can return generic 400/403/404 codes).

Handle "NoSuchKey" error code for missing S3 objects.

While boto3 does return "404" and "400" numeric strings as shown in the code, S3 also returns semantic error codes. Most critically, missing objects typically return "NoSuchKey" instead of "404". The current code only checks for numeric codes, so it won't catch the "NoSuchKey" case and will incorrectly re-raise the exception when the object doesn't exist.

Add a check for "NoSuchKey" alongside the "404" check:

elif err_code in ("404", "NoSuchKey"): return False

Additionally, line 335 uses raise e which should be just raise to preserve the original traceback.

🧰 Tools

🪛 Ruff (0.14.4)

326-331: Avoid specifying long messages outside the exception class

(TRY003)

335-335: Use raise without specifying exception name

Remove exception name

(TRY201)

🤖 Prompt for AI Agents

In snakemake_storage_plugin_s3/__init__.py around lines 321 to 335, the exception handling for missing S3 objects only checks numeric error codes and re-raises exceptions with "raise e", which loses the original traceback; update the logic to treat both "404" and the semantic "NoSuchKey" as non-existent objects (i.e., return False when err_code in ("404", "NoSuchKey")), and change the final re-raise to use a bare "raise" so the original traceback is preserved.

gilbsgilbs · 2025-11-13T15:42:47Z

Fixes the same issue as #58 .

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

snakemake_storage_plugin_s3/__init__.py Show resolved Hide resolved

gilbsgilbs force-pushed the fix-directories-detection branch from 7e23eec to 7c8acfe Compare November 13, 2025 15:08

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

gilbsgilbs mentioned this pull request Nov 13, 2025

fix: Fix spurious missing output error when using AWS Batch executor #58

Open

deto mentioned this pull request Nov 21, 2025

fix: Remove _is_dir caching, use .snakemake-timestamp instead #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: directory detection with remote executor #60

fix: directory detection with remote executor #60

Uh oh!

gilbsgilbs commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 13, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 13, 2025

Uh oh!

gilbsgilbs commented Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: directory detection with remote executor #60

Are you sure you want to change the base?

fix: directory detection with remote executor #60

Uh oh!

Conversation

gilbsgilbs commented Nov 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

gilbsgilbs commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gilbsgilbs commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 13, 2025 •

edited

Loading

gilbsgilbs commented Nov 13, 2025 •

edited

Loading