WIP: use less memory by downloading to sparse file #9

philandstuff · 2023-08-04T21:27:00Z

pget has a problem that it needs to download everything into memory along each of the different parallel connections before finally saving everything to disk. This has two consequences:

pget needs to consume at least as much memory as the size of the downloaded file
pget cannot begin writing to disk until all the bytes have been downloaded from source

I had a thought that we could precreate a sparse file of the correct length by using the os.Truncate(size) call. Then each goroutine can write to different offsets within the same file, in parallel.

I tested this locally on macOS and I was able to download a 308MB file with pget's memory usage only reaching 26MB.

This commit is a WIP because I've commented out the untar stuff to keep the proof of concept simple. We should restore (or deprecate?) that functionality before merging this. We should also do more extensive performance testing.

pget has a problem that it needs to download everything into memory along each of the different parallel connections before finally saving everything to disk. This has two consequences: - pget needs to consume at least as much memory as the size of the downloaded file - pget cannot begin writing to disk until all the bytes have been downloaded from source I had a thought that we could precreate a sparse file of the correct length by using the os.Truncate(size) call. Then each goroutine can write to different offsets within the same file, in parallel. I tested this locally on macOS and I was able to download a 308MB file with pget's memory usage only reaching 26MB. This commit is a WIP because I've commented out the untar stuff to keep the proof of concept simple. We should restore (or deprecate?) that functionality before merging this. We should also do more extensive performance testing.

technillogue · 2023-08-04T21:30:13Z

very cute use of truncate

philandstuff · 2023-08-04T21:45:25Z

Another thing we might want to do here before it gets merged is write to a .partial file and only rename to the final dest filename once we've downloaded all the bytes, to avoid leaving a half-downloaded sparse file in the wrong place if pget gets interrupted.

This also does the download to a temp partial file. If the user does not pass `-x`, we finish by renaming the temp file to the target file. This means that if pget is interrupted, we don't leave an incomplete file with the target filename. If the users does pass `-x`, we pass the temp file as input to TarReader, then delete it at the end.

Flags have to go before arguments. From https://pkg.go.dev/flag : > Flag parsing stops just before the first non-flag argument ("-" is a non-flag argument) or after the terminator "--".

so v0.0.3-sparse can be released and tested without it being a full release

philandstuff · 2023-08-07T16:41:18Z

I've reimplemented extract-tar and done the .partial thing. Ready for review.

tempusfrangit

One question, nothing blocking for sure.

I also wonder if it's worth maintaining a "toBuffer" form at all? For the record, I really like the "scribble to disk to avoid OOMing pget"

tempusfrangit · 2023-08-07T17:11:28Z

main.go

+		fmt.Printf("Error getting cwd: %v\n", err)
+		os.Exit(1)
+	}
+	destTemp, err := os.CreateTemp(cwd, dest+".partial")


Does it make sense to use the * notation to allow createTemp to create a unique file for the temp, this means that if somehow something is left on disk afterwards/failure of pget before removal, a new temp file would be created instead of trying to scribble over the top

Suggested change

destTemp, err := os.CreateTemp(cwd, dest+".partial")

destTemp, err := os.CreateTemp(cwd, fmt.Sprintf("%s-*.partial", dest))

or even os.CreateTemp(cwd, "*.partial") which it more-inline with how browsers do it now, so that filenames are really obscured and not accidentally used until complete. Since we either extract from the reference object for the tempfile or rename it, the name doesn't need to include "dest" if we don't want it.

tempusfrangit · 2023-08-07T17:17:15Z

Also, related but not blocking, we should implement testing :)

philandstuff added 3 commits August 7, 2023 17:31

fix usage text

06a6aa0

Flags have to go before arguments. From https://pkg.go.dev/flag : > Flag parsing stops just before the first non-flag argument ("-" is a non-flag argument) or after the terminator "--".

allow pushing prerelease versions

be1a93e

so v0.0.3-sparse can be released and tested without it being a full release

philandstuff marked this pull request as ready for review August 7, 2023 16:37

tempusfrangit approved these changes Aug 7, 2023

View reviewed changes

philandstuff merged commit 22f5552 into main Aug 8, 2023
1 check passed

philandstuff deleted the sparse-file-download branch August 8, 2023 12:49

philandstuff mentioned this pull request Aug 8, 2023

Revert "WIP: use less memory by downloading to sparse file" #10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: use less memory by downloading to sparse file #9

WIP: use less memory by downloading to sparse file #9

philandstuff commented Aug 4, 2023

technillogue commented Aug 4, 2023

philandstuff commented Aug 4, 2023

philandstuff commented Aug 7, 2023

tempusfrangit left a comment

tempusfrangit Aug 7, 2023 •

edited

Loading

tempusfrangit commented Aug 7, 2023

	destTemp, err := os.CreateTemp(cwd, dest+".partial")
	destTemp, err := os.CreateTemp(cwd, fmt.Sprintf("%s-*.partial", dest))

WIP: use less memory by downloading to sparse file #9

WIP: use less memory by downloading to sparse file #9

Conversation

philandstuff commented Aug 4, 2023

technillogue commented Aug 4, 2023

philandstuff commented Aug 4, 2023

philandstuff commented Aug 7, 2023

tempusfrangit left a comment

Choose a reason for hiding this comment

tempusfrangit Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

tempusfrangit commented Aug 7, 2023

tempusfrangit Aug 7, 2023 •

edited

Loading