Skip to content

Conversation

@johanneskoester
Copy link
Contributor

@johanneskoester johanneskoester commented Oct 31, 2025

fixes #34

Summary by CodeRabbit

  • New Features

    • Added local storage footprint reporting to show size for files and directories (recursively) and zero for on-demand eligible items.
  • Bug Fixes

    • Improved file transfer behavior with more explicit transfer options for reliable recursive copying, timestamps, symlink handling, and permissions/ownership preservation.

…opied (avoiding that they point to non-existing files)
@coderabbitai
Copy link

coderabbitai bot commented Oct 31, 2025

📝 Walkthrough

Walkthrough

Adds StorageObject.local_footprint() to report zero for on‑demand eligible objects, directory recursive size for directories, or object size otherwise; and replaces compact rsync flags with explicit --recursive --times --copy-links in both retrieve_object() and store_object().

Changes

Cohort / File(s) Summary
StorageObject implementation
snakemake_storage_plugin_fs/__init__.py
Added local_footprint(self) -> int: returns 0 if on‑demand eligible, total recursive size for directories, or the object’s size for files. Changed retrieve_object() rsync invocation from -av to explicit flags --recursive --times --copy-links. Changed store_object() rsync invocation from -av + permission flags to --recursive --times --copy-links.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant StorageObject
    participant Rsync as "rsync (local/remote)"

    Caller->>StorageObject: request retrieve_object(path)
    alt on-demand eligible
        StorageObject->>Caller: return (no local transfer)
    else not on-demand
        StorageObject->>Rsync: run rsync --recursive --times --copy-links
        Rsync-->>StorageObject: transfer result
        StorageObject-->>Caller: return result
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify correctness and edge cases of local_footprint() (symlinks, permissions, large directories).
  • Confirm rsync flag changes preserve intended permission/ownership behavior in target environments.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix: be more explicit about rsync args and ensure that symlinks are copied (avoiding that they point to non-existing files)" directly and clearly summarizes the primary change in the changeset. The changes show that rsync options were switched from compact -av format to explicit options (--recursive, --times, --copy-links) in both retrieve_object and store_object methods, and the --copy-links option specifically addresses the symlink handling mentioned in the title. While the changeset also includes the addition of a new local_footprint method, the title appropriately focuses on the central fix being addressed, which is consistent with the principle that titles don't need to cover every detail of a changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/rsync-args

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eda4270 and 1450666.

📒 Files selected for processing (1)
  • snakemake_storage_plugin_fs/__init__.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • snakemake_storage_plugin_fs/__init__.py
🧠 Learnings (1)
📚 Learning: 2025-03-20T08:35:40.800Z
Learnt from: johanneskoester
Repo: snakemake/snakemake-storage-plugin-fs PR: 32
File: snakemake_storage_plugin_fs/__init__.py:202-210
Timestamp: 2025-03-20T08:35:40.800Z
Learning: The `is_ondemand_eligible` attribute used in `snakemake_storage_plugin_fs` is defined and initialized in the base classes from `snakemake_interface_storage_plugins.storage_object` package, so it doesn't need to be explicitly initialized in the derived classes.

Applied to files:

  • snakemake_storage_plugin_fs/__init__.py
🔇 Additional comments (2)
snakemake_storage_plugin_fs/__init__.py (2)

226-232: Verify that the removal of permission/ownership preservation is intentional.

The explicit rsync options correctly implement --copy-links to address the PR objective. However, the previous -a flag (archive mode) also preserved permissions, group, and owner (-p, -g, -o), which are now absent.

For store_object() there's an explicit comment explaining why permissions should NOT be preserved (to inherit setgid from destination), but retrieve_object() has no such justification. Please confirm whether omitting permission preservation during retrieval is intentional, or if --perms should be added to preserve the source file permissions.


255-259: LGTM: Explicit options align with documented intent.

The explicit rsync options correctly implement the behavior described in the comments above (lines 247-249, 253-254): allowing destination permissions and ownership to be inherited (e.g., setgid) while using --copy-links to ensure symlinks are resolved to actual content.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 55fe634 and eda4270.

📒 Files selected for processing (1)
  • snakemake_storage_plugin_fs/__init__.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • snakemake_storage_plugin_fs/__init__.py
🧠 Learnings (1)
📚 Learning: 2025-03-20T08:35:40.800Z
Learnt from: johanneskoester
Repo: snakemake/snakemake-storage-plugin-fs PR: 32
File: snakemake_storage_plugin_fs/__init__.py:202-210
Timestamp: 2025-03-20T08:35:40.800Z
Learning: The `is_ondemand_eligible` attribute used in `snakemake_storage_plugin_fs` is defined and initialized in the base classes from `snakemake_interface_storage_plugins.storage_object` package, so it doesn't need to be explicitly initialized in the derived classes.

Applied to files:

  • snakemake_storage_plugin_fs/__init__.py
🔇 Additional comments (2)
snakemake_storage_plugin_fs/__init__.py (2)

225-237: LGTM! The explicit rsync options align with PR objectives.

The switch to explicit options, particularly --copy-links, ensures that symlinks are dereferenced and their content is copied, preventing broken symlinks at the destination. This directly addresses the PR's stated goal.

Note: The -D flag (preserve device and special files) from the original -a option is not included. This is likely intentional for safety and simplicity, but verify if device/special file preservation is needed for your use case.


253-264: LGTM! The explicit options correctly preserve destination permissions.

The explicit rsync options are well-chosen:

  • --copy-links dereferences symlinks (aligns with PR objective)
  • Omission of --perms, --group, --owner allows inheritance from destination (correctly implements the setgid behavior mentioned in the comment)
  • --recursive and --times provide essential functionality

Comment on lines +204 to +208
if self.query_path.is_dir():
# If the object is a directory, we have to count all files in it.
return sum(
f.stat().st_size for f in self.query_path.rglob("*") if f.is_file()
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Consider symlink handling in directory traversal.

The rglob("*") call follows symlinks by default, which could lead to:

  • Counting files outside the intended directory tree
  • Infinite loops if circular symlinks exist
  • Inconsistent footprint calculations if symlinks point to large external directories

Since the PR explicitly addresses symlink handling, consider whether this method should follow symlinks or not. If symlinks should not be followed, apply this diff:

         if self.query_path.is_dir():
             # If the object is a directory, we have to count all files in it.
             return sum(
-                f.stat().st_size for f in self.query_path.rglob("*") if f.is_file()
+                f.stat(follow_symlinks=False).st_size 
+                for f in self.query_path.rglob("*") 
+                if f.is_file() and not f.is_symlink()
             )

Alternatively, if the intent is to follow symlinks but avoid infinite loops, consider using os.walk with followlinks=False or tracking visited inodes.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.query_path.is_dir():
# If the object is a directory, we have to count all files in it.
return sum(
f.stat().st_size for f in self.query_path.rglob("*") if f.is_file()
)
if self.query_path.is_dir():
# If the object is a directory, we have to count all files in it.
return sum(
f.stat().st_size for f in self.query_path.rglob("*") if f.is_file() and not f.is_symlink()
)
🤖 Prompt for AI Agents
In snakemake_storage_plugin_fs/__init__.py around lines 204 to 208, the
directory size calculation uses self.query_path.rglob("*") which follows
symlinks by default and can count files outside the tree or cause loops; update
the traversal to avoid following symlinks by skipping symlink entries when
summing sizes (e.g., only include entries where f.is_file() and not
f.is_symlink()), or replace rglob with os.walk(..., followlinks=False) and sum
regular file sizes, and ensure you also handle broken symlinks safely to prevent
exceptions.

@johanneskoester johanneskoester changed the title fix: be more explicit about rsync args and ensure that symlinks are copied (avoiding that they point to non-existing files) fix: be more explicit about rsync args and ensure that symlinks are copied (avoiding that they point to non-existing files); to not copy permissions and ownership (thanks @landerini) Oct 31, 2025
@johanneskoester johanneskoester merged commit 505d868 into main Oct 31, 2025
7 of 8 checks passed
@johanneskoester johanneskoester deleted the fix/rsync-args branch October 31, 2025 08:18
johanneskoester pushed a commit that referenced this pull request Oct 31, 2025
🤖 I have created a release *beep* *boop*
---


##
[1.1.3](v1.1.2...v1.1.3)
(2025-10-31)


### Bug Fixes

* be more explicit about rsync args and ensure that symlinks are copied
(avoiding that they point to non-existing files); to not copy
permissions and ownership (thanks
[@landerini](https://github.com/landerini))
([#38](#38))
([505d868](505d868))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Please (optionally?) ignore user and group of the original file in the shared file system when retrieving

2 participants