Skip to content

Conversation

@gilbsgilbs
Copy link

@gilbsgilbs gilbsgilbs commented Nov 13, 2025

Previously, directories created during a Snakemake rule were initially cached as non-directories due to their nascent state in storage. Using a local executor, this would not be an issue because the cache would be invalidated during upload. However, this is not the case with remote executors. Therefore, Snakemake's would incorrectly report these directories as non-existent, causing rule failure for no reason.

This commit resolves the problem by deferring the caching of file types until their existence and type (file or directory) are definitive. This ensures that the plugin always returns a definitive state.

Summary by CodeRabbit

  • New Features

    • Added a direct check to determine whether remote S3 objects are files.
  • Bug Fixes

    • More reliable detection of file vs. directory for S3 storage objects.
    • Improved object existence checks for better validation and fewer false negatives.
  • Refactor

    • Internal status tracking streamlined for consistent behavior across store/retrieve operations.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 13, 2025

📝 Walkthrough

Walkthrough

The PR replaces the boolean _is_dir flag in StorageObject with an internal _FileType enum (UNKNOWN/FILE/DIRECTORY), adds is_file(), refactors is_dir() and exists() to use the enum and S3 probes, and updates store_object() and retrieve_object() to maintain the new type cache.

Changes

Cohort / File(s) Summary
File type state management
snakemake_storage_plugin_s3/__init__.py
Added _FileType enum (UNKNOWN/FILE/DIRECTORY) and replaced _is_dir with _file_type initialized to _FileType.UNKNOWN in __post_init__()
Type detection methods
snakemake_storage_plugin_s3/__init__.py
Added is_file() method; reworked is_dir() to use _file_type, subkey/bucket checks and update cache; simplified exists() to return is_file() or is_dir()
Storage operations
snakemake_storage_plugin_s3/__init__.py
store_object() now sets _file_type to DIRECTORY or FILE based on local path; retrieve_object() uses is_file()/is_dir() flow and updated type-tracking

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant StorageObject
    participant S3 as S3 API

    note over StorageObject: New type-discriminator flow

    Caller->>StorageObject: exists()
    StorageObject->>StorageObject: is_file() or is_dir()
    alt cached as FILE
        StorageObject-->>Caller: True (file)
    else cached as DIRECTORY
        StorageObject-->>Caller: True (directory)
    else UNKNOWN
        StorageObject->>S3: probe via s3obj().load() / list subkeys
        S3-->>StorageObject: object exists / subkeys present / 404
        StorageObject->>StorageObject: update _file_type (FILE/DIRECTORY/UNKNOWN)
        StorageObject-->>Caller: True/False
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas needing extra attention:

  • is_file() probing and exception handling around s3obj().load()
  • is_dir() logic for subkey listing and bucket-existence checks
  • Consistency of _file_type updates across store_object() and retrieve_object()

Possibly related PRs

Suggested reviewers

  • johanneskoester

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix: directory detection with remote executor' directly aligns with the main objective of fixing directory detection issues specifically when using remote executors.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d067f40 and 7e23eec.

📒 Files selected for processing (1)
  • snakemake_storage_plugin_s3/__init__.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • snakemake_storage_plugin_s3/__init__.py
🪛 Ruff (0.14.4)
snakemake_storage_plugin_s3/__init__.py

326-331: Avoid specifying long messages outside the exception class

(TRY003)


335-335: Use raise without specifying exception name

Remove exception name

(TRY201)

Previously, directories created during a Snakemake rule were initially
cached as non-directories due to their nascent state in storage. Using a
local executor, this would not be an issue because the cache would be
invalidated during upload. However, this is not the case with remote
executors. Therefore, Snakemake's would incorrectly report these
directories as non-existent, causing rule failure for no reason.

This commit resolves the problem by deferring the caching of file types
until their existence and type (file or directory) are definitive. This
ensures that the plugin always returns a definitive state.
@gilbsgilbs gilbsgilbs force-pushed the fix-directories-detection branch from 7e23eec to 7c8acfe Compare November 13, 2025 15:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
snakemake_storage_plugin_s3/__init__.py (1)

335-335: Use bare raise to preserve traceback.

Replace raise e with bare raise to preserve the full exception traceback for debugging.

As per coding guidelines and static analysis hints.

             else:
-                raise e
+                raise
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e23eec and 7c8acfe.

📒 Files selected for processing (1)
  • snakemake_storage_plugin_s3/__init__.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • snakemake_storage_plugin_s3/__init__.py
🪛 Ruff (0.14.4)
snakemake_storage_plugin_s3/__init__.py

326-331: Avoid specifying long messages outside the exception class

(TRY003)


335-335: Use raise without specifying exception name

Remove exception name

(TRY201)

🔇 Additional comments (5)
snakemake_storage_plugin_s3/__init__.py (5)

1-1: LGTM! Clean enum-based type tracking.

The introduction of _FileType enum with UNKNOWN/FILE/DIRECTORY states provides a cleaner three-state type system that enables deferred caching as intended by the fix.

Also applies to: 211-215


236-236: LGTM! Proper initialization.

Initializing _file_type to UNKNOWN correctly defers type determination until the first definitive check.


288-288: LGTM! Clean delegation to type-checking methods.

The simplified implementation correctly leverages short-circuit evaluation and the internal cache to minimize S3 API calls.


340-348: LGTM! Correct directory detection logic.

The method correctly identifies S3 directories by checking for subkeys with the object's key as a prefix, and properly caches the result when definitive.


374-383: LGTM! Proper cache updates after upload.

Correctly caches the file type as DIRECTORY or FILE after upload, when the type is definitively known. This aligns with the PR's objective to cache only when the state is definitive.

Comment on lines +321 to +335
try:
self.s3obj().load()
except botocore.exceptions.ClientError as e:
err_code = e.response["Error"]["Code"]
if err_code == "400":
raise WorkflowError(
f"Bad request for S3 object {self.query}. This can happen if "
"the query contains unsupported characters (e.g. '..'), "
"if your credentials are invalid, or if your permissions are "
"insufficient."
) from e
elif err_code == "404":
return False
else:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify boto3 error codes for s3obj().load().

The error code checks compare against "400" and "404" strings. However, boto3's ClientError.response["Error"]["Code"] typically returns semantic error codes (e.g., "NoSuchKey", "BadRequest") rather than HTTP status codes. Please verify that s3obj().load() actually returns these numeric string codes, or update to check the appropriate error codes.


🏁 Script executed:

#!/bin/bash
# Description: Search for boto3 error code handling patterns in the codebase
# to verify the expected error code format

rg -nP --type=py -C3 'response\["Error"\]\["Code"\]'

Length of output: 610


🌐 Web query:

What error codes does boto3 S3 Object.load() return for 404 not found and 400 bad request?

💡 Result:

  • 404 Not Found: boto3 raises a botocore.exceptions.ClientError whose e.response['Error']['Code'] is usually "404" (and S3 may also return "NoSuchKey"). [1][2]

  • 400 Bad Request: boto3 raises ClientError with e.response['Error']['Code'] equal to "400" (or a more specific S3 error code such as "BadRequest", "InvalidArgument", "AuthorizationHeaderMalformed", etc., depending on the exact problem). [2][3]

Sources:
[1] Stack Overflow: examples showing e.response['Error']['Code'] == '404' or 'NoSuchKey'.
[2] AWS S3 HeadObject / Error responses (lists 404/400 behaviors and NoSuchKey / various 4xx error codes).
[3] Boto3 head_object docs (HEAD can return generic 400/403/404 codes).


Handle "NoSuchKey" error code for missing S3 objects.

While boto3 does return "404" and "400" numeric strings as shown in the code, S3 also returns semantic error codes. Most critically, missing objects typically return "NoSuchKey" instead of "404". The current code only checks for numeric codes, so it won't catch the "NoSuchKey" case and will incorrectly re-raise the exception when the object doesn't exist.

Add a check for "NoSuchKey" alongside the "404" check:

elif err_code in ("404", "NoSuchKey"):
    return False

Additionally, line 335 uses raise e which should be just raise to preserve the original traceback.

🧰 Tools
🪛 Ruff (0.14.4)

326-331: Avoid specifying long messages outside the exception class

(TRY003)


335-335: Use raise without specifying exception name

Remove exception name

(TRY201)

🤖 Prompt for AI Agents
In snakemake_storage_plugin_s3/__init__.py around lines 321 to 335, the
exception handling for missing S3 objects only checks numeric error codes and
re-raises exceptions with "raise e", which loses the original traceback; update
the logic to treat both "404" and the semantic "NoSuchKey" as non-existent
objects (i.e., return False when err_code in ("404", "NoSuchKey")), and change
the final re-raise to use a bare "raise" so the original traceback is preserved.

@gilbsgilbs
Copy link
Author

gilbsgilbs commented Nov 13, 2025

Fixes the same issue as #58 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant