-
Notifications
You must be signed in to change notification settings - Fork 16
fix: directory detection with remote executor #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThe PR replaces the boolean Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant StorageObject
participant S3 as S3 API
note over StorageObject: New type-discriminator flow
Caller->>StorageObject: exists()
StorageObject->>StorageObject: is_file() or is_dir()
alt cached as FILE
StorageObject-->>Caller: True (file)
else cached as DIRECTORY
StorageObject-->>Caller: True (directory)
else UNKNOWN
StorageObject->>S3: probe via s3obj().load() / list subkeys
S3-->>StorageObject: object exists / subkeys present / 404
StorageObject->>StorageObject: update _file_type (FILE/DIRECTORY/UNKNOWN)
StorageObject-->>Caller: True/False
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Areas needing extra attention:
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_storage_plugin_s3/__init__.py(6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
snakemake_storage_plugin_s3/__init__.py
🪛 Ruff (0.14.4)
snakemake_storage_plugin_s3/__init__.py
326-331: Avoid specifying long messages outside the exception class
(TRY003)
335-335: Use raise without specifying exception name
Remove exception name
(TRY201)
Previously, directories created during a Snakemake rule were initially cached as non-directories due to their nascent state in storage. Using a local executor, this would not be an issue because the cache would be invalidated during upload. However, this is not the case with remote executors. Therefore, Snakemake's would incorrectly report these directories as non-existent, causing rule failure for no reason. This commit resolves the problem by deferring the caching of file types until their existence and type (file or directory) are definitive. This ensures that the plugin always returns a definitive state.
7e23eec to
7c8acfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
snakemake_storage_plugin_s3/__init__.py (1)
335-335: Use bareraiseto preserve traceback.Replace
raise ewith bareraiseto preserve the full exception traceback for debugging.As per coding guidelines and static analysis hints.
else: - raise e + raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_storage_plugin_s3/__init__.py(6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
snakemake_storage_plugin_s3/__init__.py
🪛 Ruff (0.14.4)
snakemake_storage_plugin_s3/__init__.py
326-331: Avoid specifying long messages outside the exception class
(TRY003)
335-335: Use raise without specifying exception name
Remove exception name
(TRY201)
🔇 Additional comments (5)
snakemake_storage_plugin_s3/__init__.py (5)
1-1: LGTM! Clean enum-based type tracking.The introduction of
_FileTypeenum with UNKNOWN/FILE/DIRECTORY states provides a cleaner three-state type system that enables deferred caching as intended by the fix.Also applies to: 211-215
236-236: LGTM! Proper initialization.Initializing
_file_typetoUNKNOWNcorrectly defers type determination until the first definitive check.
288-288: LGTM! Clean delegation to type-checking methods.The simplified implementation correctly leverages short-circuit evaluation and the internal cache to minimize S3 API calls.
340-348: LGTM! Correct directory detection logic.The method correctly identifies S3 directories by checking for subkeys with the object's key as a prefix, and properly caches the result when definitive.
374-383: LGTM! Proper cache updates after upload.Correctly caches the file type as
DIRECTORYorFILEafter upload, when the type is definitively known. This aligns with the PR's objective to cache only when the state is definitive.
| try: | ||
| self.s3obj().load() | ||
| except botocore.exceptions.ClientError as e: | ||
| err_code = e.response["Error"]["Code"] | ||
| if err_code == "400": | ||
| raise WorkflowError( | ||
| f"Bad request for S3 object {self.query}. This can happen if " | ||
| "the query contains unsupported characters (e.g. '..'), " | ||
| "if your credentials are invalid, or if your permissions are " | ||
| "insufficient." | ||
| ) from e | ||
| elif err_code == "404": | ||
| return False | ||
| else: | ||
| raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify boto3 error codes for s3obj().load().
The error code checks compare against "400" and "404" strings. However, boto3's ClientError.response["Error"]["Code"] typically returns semantic error codes (e.g., "NoSuchKey", "BadRequest") rather than HTTP status codes. Please verify that s3obj().load() actually returns these numeric string codes, or update to check the appropriate error codes.
🏁 Script executed:
#!/bin/bash
# Description: Search for boto3 error code handling patterns in the codebase
# to verify the expected error code format
rg -nP --type=py -C3 'response\["Error"\]\["Code"\]'Length of output: 610
🌐 Web query:
What error codes does boto3 S3 Object.load() return for 404 not found and 400 bad request?
💡 Result:
-
404 Not Found: boto3 raises a botocore.exceptions.ClientError whose e.response['Error']['Code'] is usually "404" (and S3 may also return "NoSuchKey"). [1][2]
-
400 Bad Request: boto3 raises ClientError with e.response['Error']['Code'] equal to "400" (or a more specific S3 error code such as "BadRequest", "InvalidArgument", "AuthorizationHeaderMalformed", etc., depending on the exact problem). [2][3]
Sources:
[1] Stack Overflow: examples showing e.response['Error']['Code'] == '404' or 'NoSuchKey'.
[2] AWS S3 HeadObject / Error responses (lists 404/400 behaviors and NoSuchKey / various 4xx error codes).
[3] Boto3 head_object docs (HEAD can return generic 400/403/404 codes).
Handle "NoSuchKey" error code for missing S3 objects.
While boto3 does return "404" and "400" numeric strings as shown in the code, S3 also returns semantic error codes. Most critically, missing objects typically return "NoSuchKey" instead of "404". The current code only checks for numeric codes, so it won't catch the "NoSuchKey" case and will incorrectly re-raise the exception when the object doesn't exist.
Add a check for "NoSuchKey" alongside the "404" check:
elif err_code in ("404", "NoSuchKey"):
return FalseAdditionally, line 335 uses raise e which should be just raise to preserve the original traceback.
🧰 Tools
🪛 Ruff (0.14.4)
326-331: Avoid specifying long messages outside the exception class
(TRY003)
335-335: Use raise without specifying exception name
Remove exception name
(TRY201)
🤖 Prompt for AI Agents
In snakemake_storage_plugin_s3/__init__.py around lines 321 to 335, the
exception handling for missing S3 objects only checks numeric error codes and
re-raises exceptions with "raise e", which loses the original traceback; update
the logic to treat both "404" and the semantic "NoSuchKey" as non-existent
objects (i.e., return False when err_code in ("404", "NoSuchKey")), and change
the final re-raise to use a bare "raise" so the original traceback is preserved.
|
Fixes the same issue as #58 . |
Previously, directories created during a Snakemake rule were initially cached as non-directories due to their nascent state in storage. Using a local executor, this would not be an issue because the cache would be invalidated during upload. However, this is not the case with remote executors. Therefore, Snakemake's would incorrectly report these directories as non-existent, causing rule failure for no reason.
This commit resolves the problem by deferring the caching of file types until their existence and type (file or directory) are definitive. This ensures that the plugin always returns a definitive state.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor