Transfer files from DWD HTTPs server to Source Co-Op using rclone copyurl --urls#406
Transfer files from DWD HTTPs server to Source Co-Op using rclone copyurl --urls#406aldenks merged 19 commits intodynamical-org:mainfrom
rclone copyurl --urls#406Conversation
found out about copyurls, which is much better.
of the heavy lifting.
|
🎉 Will review soon! |
|
Thank you! No huge rush - I'll be busy this week with other stuff. And sorry for this being the third re-write of this functionality! "Third time's a charm", right?! I'm feeling optimistic that this approach is the simplest yet 🙂 |
| for src_path in src_paths_starting_with_nwp_var: | ||
| dst_path = convert_src_path_to_dst_path(src_path) | ||
| if dst_path not in files_already_on_dst: | ||
| full_src_path = f"{src_host_and_root_path}/{src_path}" |
There was a problem hiding this comment.
I'd be tempted to use urlpath.URL here, to ensure we join URLs correctly, if you're happy for me to add urlpath as a dependency?
There was a problem hiding this comment.
would the benefit be to make sure we don't have double slashes? i think id rather ensure that with a unit test or assert since we don't have many cases to deal with here if that's the specific benefit, but maybe there's more?
aldenks
left a comment
There was a problem hiding this comment.
Awesome! Thanks for the iterations on this -- "easy" things often aren't so easy at some scale 🙃
If you want to merge and do follow ups in a separate PR thats good too, they are all minor. I'll let you know when i can get this up and testing, not sure exactly but in the next week for sure.
| for src_path in src_paths_starting_with_nwp_var: | ||
| dst_path = convert_src_path_to_dst_path(src_path) | ||
| if dst_path not in files_already_on_dst: | ||
| full_src_path = f"{src_host_and_root_path}/{src_path}" |
There was a problem hiding this comment.
would the benefit be to make sure we don't have double slashes? i think id rather ensure that with a unit test or assert since we don't have many cases to deal with here if that's the specific benefit, but maybe there's more?
| _log_error_from_called_process_error(e) | ||
| raise |
There was a problem hiding this comment.
what about a
try:
_log_err...
finally:
raise
to make sure the nice logging doesn't get in the way of raising?
| def extract_nwp_init_datetime_from_grib_filename( | ||
| grib_filename: str, | ||
| ) -> datetime: | ||
| dwd_nwp_init_date_regex: Final[re.Pattern[str]] = re.compile(r"_(\d{10})(?=_)") |
There was a problem hiding this comment.
its probably not material perf wise, but the .compile is happening on each call. We could put in a constant
| # It would've made more sense for `dst_root` to be a `PurePosixPath` but Typer doesn't | ||
| # handle `PurePosixPath`, so we use a `str` to keep Typer happy. | ||
| dst_root: str = grib_archive_path, | ||
| dst_root_path: str = grib_archive_path, |
There was a problem hiding this comment.
Will you make the grib_archive_path Final? and maybe it should be called dynamical_grib_archive_rclone_root to make it clear its our archive and rclone specific
|
Actually, @JackKelly im going to merge this so i can wrap up the |
|
Awesome, thanks! I'll implement your suggestions in a follow-up PR! |
This PR should be ready for testing in production. I've tested it on an EC2 VM, copying to an S3 bucket, and it was getting 50 MiB/s, and it only takes about 30 secs to check an entire NWP run 🙂
Aims
Implementation notes
The approach in this PR differs from the previous approach (PR #360 & #370) in that this PR:
us-west-2, and given the tiny size of the grib files, and given the "chatty" nature of the FTP protocol. FTP requires about 7 round trips per file, which, over 8,000 km, is 1 second of overhead per file, even though the actual payload might only take a fraction of a second to transfer. In contrast, DWD's HTTPS server doesn't appear to have an upper limit on concurrent connections. And HTTP is less "chatty" than FTP: once the connection is established, HTTPS only needs 1 back-and-forth per file.rclone copyurl --urls src_and_dst_paths.csvinstead ofrclone copy.rclone copyurlis a gem hidden deep withinrclone's docs, and I only discovered it by chance a few days ago (and the docs don't make it clear thatcopyurlis capable of solving our problems, to the extent that Gemini firmly believes thatcopyurlcannot do what this PR does withcopyurl! (UPDATE: I've submitted a PR torcloneto update the docs)). AFAICT,copyurlis the onlyrclonecommand that allows us to provide a list of mappings from full source URL to full destination path. In contrast,rclone copy --files-fromonly allows us to provide a list of source files;rclone copycan't modify the structure of the destination path (e.g. extracting the NWP init datetime from the base filename of the grib, and using that datetime in an early part of the destination path).rclone copyurlis wonderful because it allows us to provide a list of arbitrary mappings from source URL to destination path.rclone copyto figure out which files already exist on the destination. This was slow.rclone copyurldoesn't check the destination before copying. It just blindly copies.rcloneto list all the files on the HTTPS server, and list the relevant files in the destination path, and then our Python code computes the set difference. So we can guarantee that this check only happens once per run. And it's fast. It only takes about 30 seconds to check an entire NWP run, whereas the old approach took several minutes to check the larger NWP variable directories.rclone, this PR instead takes a simpler approach: We just persuadercloneto output useful logs as strings, and log those strings. No more JSON parsing; and no more adding upTransferSummaryobjects. We still get a regular log saying how much data has been transferred, and the throughput. We do lose a tiny bit of functionality, in that the current version doesn't say how much is left to transfer. We can add that back in if needs be.TODO
src/reformatters/dwd/archive_gribsfolder, with an__init__.py.Related
rclonecopy from DWD's FTP server to Source Co-Op #392rclone copyfrom DWD's FTP server to Source Co-Op #390rclone#360