Skip to content

Transfer files from DWD HTTPs server to Source Co-Op using rclone copyurl --urls#406

Merged
aldenks merged 19 commits intodynamical-org:mainfrom
JackKelly:rclone-https
Feb 3, 2026
Merged

Transfer files from DWD HTTPs server to Source Co-Op using rclone copyurl --urls#406
aldenks merged 19 commits intodynamical-org:mainfrom
JackKelly:rclone-https

Conversation

@JackKelly
Copy link
Contributor

@JackKelly JackKelly commented Jan 31, 2026

This PR should be ready for testing in production. I've tested it on an EC2 VM, copying to an S3 bucket, and it was getting 50 MiB/s, and it only takes about 30 secs to check an entire NWP run 🙂

Aims

  • Speed up every part of the file transfer from DWD to Source Co-Op.
  • Provide a solution that can handle DWD's planned directory format.

Implementation notes

The approach in this PR differs from the previous approach (PR #360 & #370) in that this PR:

  • Downloads files from DWD's HTTPS server instead of the FTP server.
    • DWD's FTP server doesn't allow more than 10 concurrent connections from one IP address. This lack of concurrency was killing transfer performance given the ~150ms latency over the 8,000 km from Germany to us-west-2, and given the tiny size of the grib files, and given the "chatty" nature of the FTP protocol. FTP requires about 7 round trips per file, which, over 8,000 km, is 1 second of overhead per file, even though the actual payload might only take a fraction of a second to transfer. In contrast, DWD's HTTPS server doesn't appear to have an upper limit on concurrent connections. And HTTP is less "chatty" than FTP: once the connection is established, HTTPS only needs 1 back-and-forth per file.
  • Uses rclone copyurl --urls src_and_dst_paths.csv instead of rclone copy.
    • rclone copyurl is a gem hidden deep within rclone's docs, and I only discovered it by chance a few days ago (and the docs don't make it clear that copyurl is capable of solving our problems, to the extent that Gemini firmly believes that copyurl cannot do what this PR does with copyurl! (UPDATE: I've submitted a PR to rclone to update the docs)). AFAICT, copyurl is the only rclone command that allows us to provide a list of mappings from full source URL to full destination path. In contrast, rclone copy --files-from only allows us to provide a list of source files; rclone copy can't modify the structure of the destination path (e.g. extracting the NWP init datetime from the base filename of the grib, and using that datetime in an early part of the destination path). rclone copyurl is wonderful because it allows us to provide a list of arbitrary mappings from source URL to destination path.
  • Uses Python to figures out which files have already been transferred
    • The previous design relied on rclone copy to figure out which files already exist on the destination. This was slow.
    • In contrast rclone copyurl doesn't check the destination before copying. It just blindly copies.
    • So, the Python code in this PR first asks rclone to list all the files on the HTTPS server, and list the relevant files in the destination path, and then our Python code computes the set difference. So we can guarantee that this check only happens once per run. And it's fast. It only takes about 30 seconds to check an entire NWP run, whereas the old approach took several minutes to check the larger NWP variable directories.
  • Simplified logging
    • Instead of writing fiddly Python code to parse JSON logs from rclone, this PR instead takes a simpler approach: We just persuade rclone to output useful logs as strings, and log those strings. No more JSON parsing; and no more adding up TransferSummary objects. We still get a regular log saying how much data has been transferred, and the throughput. We do lose a tiny bit of functionality, in that the current version doesn't say how much is left to transfer. We can add that back in if needs be.

TODO

  • Put into a src/reformatters/dwd/archive_gribs folder, with an __init__.py.
  • Split into several Python files.
  • Benchmark on an OCF AWS EC2 VM, copying from DWD HTTPS to S3. DONE. It's fast! (~50 MiB/s). Logs here.
  • Write unit tests

Related

@JackKelly JackKelly self-assigned this Jan 31, 2026
@JackKelly JackKelly added the enhancement New feature or request label Jan 31, 2026
@JackKelly JackKelly requested review from aldenks and removed request for aldenks January 31, 2026 11:42
@JackKelly JackKelly marked this pull request as ready for review January 31, 2026 22:08
@aldenks
Copy link
Member

aldenks commented Feb 2, 2026

🎉 Will review soon!

@JackKelly JackKelly requested a review from aldenks February 2, 2026 08:27
@JackKelly
Copy link
Contributor Author

Thank you! No huge rush - I'll be busy this week with other stuff. And sorry for this being the third re-write of this functionality! "Third time's a charm", right?! I'm feeling optimistic that this approach is the simplest yet 🙂

for src_path in src_paths_starting_with_nwp_var:
dst_path = convert_src_path_to_dst_path(src_path)
if dst_path not in files_already_on_dst:
full_src_path = f"{src_host_and_root_path}/{src_path}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to use urlpath.URL here, to ensure we join URLs correctly, if you're happy for me to add urlpath as a dependency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the benefit be to make sure we don't have double slashes? i think id rather ensure that with a unit test or assert since we don't have many cases to deal with here if that's the specific benefit, but maybe there's more?

Copy link
Member

@aldenks aldenks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for the iterations on this -- "easy" things often aren't so easy at some scale 🙃

If you want to merge and do follow ups in a separate PR thats good too, they are all minor. I'll let you know when i can get this up and testing, not sure exactly but in the next week for sure.

for src_path in src_paths_starting_with_nwp_var:
dst_path = convert_src_path_to_dst_path(src_path)
if dst_path not in files_already_on_dst:
full_src_path = f"{src_host_and_root_path}/{src_path}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the benefit be to make sure we don't have double slashes? i think id rather ensure that with a unit test or assert since we don't have many cases to deal with here if that's the specific benefit, but maybe there's more?

Comment on lines +100 to +101
_log_error_from_called_process_error(e)
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about a

try:
_log_err...
finally:
raise

to make sure the nice logging doesn't get in the way of raising?

def extract_nwp_init_datetime_from_grib_filename(
grib_filename: str,
) -> datetime:
dwd_nwp_init_date_regex: Final[re.Pattern[str]] = re.compile(r"_(\d{10})(?=_)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably not material perf wise, but the .compile is happening on each call. We could put in a constant

# It would've made more sense for `dst_root` to be a `PurePosixPath` but Typer doesn't
# handle `PurePosixPath`, so we use a `str` to keep Typer happy.
dst_root: str = grib_archive_path,
dst_root_path: str = grib_archive_path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you make the grib_archive_path Final? and maybe it should be called dynamical_grib_archive_rclone_root to make it clear its our archive and rclone specific

@aldenks
Copy link
Member

aldenks commented Feb 3, 2026

Actually, @JackKelly im going to merge this so i can wrap up the ty switch

@aldenks aldenks merged commit 38e4ecf into dynamical-org:main Feb 3, 2026
2 checks passed
@JackKelly
Copy link
Contributor Author

Awesome, thanks! I'll implement your suggestions in a follow-up PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up rclone copy from DWD's FTP server to Source Co-Op

2 participants