Generator for chunks of content stream #96

jmoldow · 2015-12-14T22:49:27Z

Today, there are two options for downloading a file:

File.content(), which requests the entire file at once, and loads it all into memory.
File.download_to(), which requests the file in chunks, and dumps it into a stream (file handle or another instance of io.IOBase).

The first option always loads the full contents into memory. The second option always loads the full contents into memory except in the case where the stream is a file handle. Additionally, both options always download the full file before handing control back to the caller.

There should be a third option for download, which only downloads and loads individual chunks at a time, and is a generator and therefore hands control back to the caller for each chunk.

I've implemented this both locally and in a comment on #93, and it looks something like this:

from contextlib import contextmanager, closing

class File(Item):
    ...

    # A contextmanager so that we can force the HTTP stream to be closed when we're done with it.
    @contextmanager
    def content_chunks(self):
        url = self.get_url('content')
        with closing(self._session.get(url, expect_json_response=False, stream=True)) as box_response:
            yield box_response.network_response.response_as_stream().stream(decode_content=True)

    # We can redefine download_to() in terms of content_chunks().
    def download_to(self, writeable_stream):
        with self.content_chunks() as chunks:
            for chunk in chunks:
                writeable_stream.write(chunk)

# Usage example 1
with my_file.content_chunks() as chunks:
    for chunk in chunks:
        # Handle chunk

# Usage example 2
with my_file.content_chunks() as chunks:
    # Handle chunks

This could be added as-is, but I haven't created a PR yet because I wanted to think more about #87. I was wondering if there might be a better way to expose streams in the abstract Network interface. Right now we have a generic response_as_stream() method on NetworkResponse, but it isn't really generic because:

It requires stream=True to be passed to request() on Network, which isn't documented because it is specific to the requests library.
After calling response_as_stream(), you must call its stream() method, which isn't documented because it is specific to the requests library.

We could just add this as-is, because it doesn't technically add any additional dependencies on requests that weren't already added via download_to().

The text was updated successfully, but these errors were encountered:

mgrytsai · 2022-07-01T09:05:28Z

Adding a bit of context here.

he observation was that all download methods would load the entire file into memory, unless you pass a file handle to download_to. Perhaps that is fine. But if you wanted to download the file, but not download it to a file on-disk, but not load the entire contents into memory all at once, there was no way to do that.

If you wanted to:

Perform stream processing on a large file (reactively with the download; without downloading it to an on-disk file); or
If you wanted to be able to manually control when the bytes for the file are downloaded

then something like what I proposed would possibly be necessary.

It looks like I opened that issue originally in response to this user comment: #56 (comment)

This request is logged in our Backlog as SDK-307 ticket.

Jeff-Meadows · 2022-07-01T19:48:16Z

The stream processing case is already possible with the download_to method, by simply subclassing io.BytesIO.

from io import BytesIO

class StreamingWritesBytesIO(BytesIO):
    def __init__(self, on_write):
        super().__init__()
        self._on_write = on_write

    def write(self, b):
        self._on_write(b)
        return len(b)

swio = StreamingWritesBytesIO(lambda b: print(f'Downloaded {len(b)} bytes...'))
client.file('1234').download_to(swio)

Since StreamingWritesBytesIO overrides the write method, and doesn't call super, the content isn't retained in memory (unless the implementation of on_write happens to do so).

The approach also has a few other benefits, like allowing for canceling the download (by raising an exception), but it doesn't give full control to the caller (like exposing a generator might).

jmoldow added the enhancement label Dec 14, 2015

jmoldow self-assigned this Dec 14, 2015

jmoldow mentioned this issue Dec 14, 2015

Content bytes option #56

Closed

jmoldow modified the milestone: v2.0.0 Major version bump Aug 24, 2016

mgrytsai removed this from the v2.0.0 Major version bump milestone Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generator for chunks of content stream #96

Generator for chunks of content stream #96

jmoldow commented Dec 14, 2015

mgrytsai commented Jul 1, 2022

Jeff-Meadows commented Jul 1, 2022

Generator for chunks of content stream #96

Generator for chunks of content stream #96

Comments

jmoldow commented Dec 14, 2015

mgrytsai commented Jul 1, 2022

Jeff-Meadows commented Jul 1, 2022