Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: use less memory by downloading to sparse file #9

Merged
merged 4 commits into from
Aug 8, 2023

Conversation

philandstuff
Copy link
Contributor

pget has a problem that it needs to download everything into memory along each of the different parallel connections before finally saving everything to disk. This has two consequences:

  • pget needs to consume at least as much memory as the size of the downloaded file
  • pget cannot begin writing to disk until all the bytes have been downloaded from source

I had a thought that we could precreate a sparse file of the correct length by using the os.Truncate(size) call. Then each goroutine can write to different offsets within the same file, in parallel.

I tested this locally on macOS and I was able to download a 308MB file with pget's memory usage only reaching 26MB.

This commit is a WIP because I've commented out the untar stuff to keep the proof of concept simple. We should restore (or deprecate?) that functionality before merging this. We should also do more extensive performance testing.

pget has a problem that it needs to download everything into memory
along each of the different parallel connections before finally saving
everything to disk.  This has two consequences:

- pget needs to consume at least as much memory as the size of the
  downloaded file
- pget cannot begin writing to disk until all the bytes have been
  downloaded from source

I had a thought that we could precreate a sparse file of the correct
length by using the os.Truncate(size) call.  Then each goroutine can
write to different offsets within the same file, in parallel.

I tested this locally on macOS and I was able to download a 308MB file
with pget's memory usage only reaching 26MB.

This commit is a WIP because I've commented out the untar stuff to
keep the proof of concept simple.  We should restore (or deprecate?)
that functionality before merging this.  We should also do more
extensive performance testing.
@technillogue
Copy link
Contributor

very cute use of truncate

@philandstuff
Copy link
Contributor Author

Another thing we might want to do here before it gets merged is write to a .partial file and only rename to the final dest filename once we've downloaded all the bytes, to avoid leaving a half-downloaded sparse file in the wrong place if pget gets interrupted.

This also does the download to a temp partial file.

If the user does not pass `-x`, we finish by renaming the temp file to
the target file.  This means that if pget is interrupted, we don't
leave an incomplete file with the target filename.

If the users does pass `-x`, we pass the temp file as input to
TarReader, then delete it at the end.
Flags have to go before arguments. From https://pkg.go.dev/flag :

> Flag parsing stops just before the first non-flag argument ("-" is a non-flag argument) or after the terminator "--".
so v0.0.3-sparse can be released and tested without it being a full
release
@philandstuff philandstuff marked this pull request as ready for review August 7, 2023 16:37
@philandstuff
Copy link
Contributor Author

I've reimplemented extract-tar and done the .partial thing. Ready for review.

Copy link
Member

@tempusfrangit tempusfrangit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, nothing blocking for sure.

I also wonder if it's worth maintaining a "toBuffer" form at all? For the record, I really like the "scribble to disk to avoid OOMing pget"

fmt.Printf("Error getting cwd: %v\n", err)
os.Exit(1)
}
destTemp, err := os.CreateTemp(cwd, dest+".partial")
Copy link
Member

@tempusfrangit tempusfrangit Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use the * notation to allow createTemp to create a unique file for the temp, this means that if somehow something is left on disk afterwards/failure of pget before removal, a new temp file would be created instead of trying to scribble over the top

Suggested change
destTemp, err := os.CreateTemp(cwd, dest+".partial")
destTemp, err := os.CreateTemp(cwd, fmt.Sprintf("%s-*.partial", dest))

or even os.CreateTemp(cwd, "*.partial") which it more-inline with how browsers do it now, so that filenames are really obscured and not accidentally used until complete. Since we either extract from the reference object for the tempfile or rename it, the name doesn't need to include "dest" if we don't want it.

@tempusfrangit
Copy link
Member

Also, related but not blocking, we should implement testing :)

@philandstuff philandstuff merged commit 22f5552 into main Aug 8, 2023
1 check passed
@philandstuff philandstuff deleted the sparse-file-download branch August 8, 2023 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants