-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: use less memory by downloading to sparse file #9
Conversation
pget has a problem that it needs to download everything into memory along each of the different parallel connections before finally saving everything to disk. This has two consequences: - pget needs to consume at least as much memory as the size of the downloaded file - pget cannot begin writing to disk until all the bytes have been downloaded from source I had a thought that we could precreate a sparse file of the correct length by using the os.Truncate(size) call. Then each goroutine can write to different offsets within the same file, in parallel. I tested this locally on macOS and I was able to download a 308MB file with pget's memory usage only reaching 26MB. This commit is a WIP because I've commented out the untar stuff to keep the proof of concept simple. We should restore (or deprecate?) that functionality before merging this. We should also do more extensive performance testing.
very cute use of truncate |
Another thing we might want to do here before it gets merged is write to a |
This also does the download to a temp partial file. If the user does not pass `-x`, we finish by renaming the temp file to the target file. This means that if pget is interrupted, we don't leave an incomplete file with the target filename. If the users does pass `-x`, we pass the temp file as input to TarReader, then delete it at the end.
Flags have to go before arguments. From https://pkg.go.dev/flag : > Flag parsing stops just before the first non-flag argument ("-" is a non-flag argument) or after the terminator "--".
so v0.0.3-sparse can be released and tested without it being a full release
I've reimplemented extract-tar and done the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, nothing blocking for sure.
I also wonder if it's worth maintaining a "toBuffer" form at all? For the record, I really like the "scribble to disk to avoid OOMing pget"
fmt.Printf("Error getting cwd: %v\n", err) | ||
os.Exit(1) | ||
} | ||
destTemp, err := os.CreateTemp(cwd, dest+".partial") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to use the *
notation to allow createTemp to create a unique file for the temp, this means that if somehow something is left on disk afterwards/failure of pget before removal, a new temp file would be created instead of trying to scribble over the top
destTemp, err := os.CreateTemp(cwd, dest+".partial") | |
destTemp, err := os.CreateTemp(cwd, fmt.Sprintf("%s-*.partial", dest)) |
or even os.CreateTemp(cwd, "*.partial")
which it more-inline with how browsers do it now, so that filenames are really obscured and not accidentally used until complete. Since we either extract from the reference object for the tempfile or rename it, the name doesn't need to include "dest" if we don't want it.
Also, related but not blocking, we should implement testing :) |
pget has a problem that it needs to download everything into memory along each of the different parallel connections before finally saving everything to disk. This has two consequences:
I had a thought that we could precreate a sparse file of the correct length by using the os.Truncate(size) call. Then each goroutine can write to different offsets within the same file, in parallel.
I tested this locally on macOS and I was able to download a 308MB file with pget's memory usage only reaching 26MB.
This commit is a WIP because I've commented out the untar stuff to keep the proof of concept simple. We should restore (or deprecate?) that functionality before merging this. We should also do more extensive performance testing.