Is there a way to prevent st_read from reading the full Excel file/sheet into memory? #243

tboddyspargo · 2024-01-30T15:34:07Z

tboddyspargo
Jan 30, 2024

After performing some profiling tests, I've concluded that using st_read with an Excel file is not memory safe in the way that read_csv_auto is (eg. when using LIMIT and OFFSET). I'd like to make sure that, even when loading a very large Excel sheet, I can read it in chunks and not need to worry about having enough memory to store the entire file. Is that possible today? If not, is there a plan for it or any alternative approaches that might allow me to continue using duckdb for this use-case, but achieve partial reading of an Excel file as a duckdb relation object?

Maxxen · 2024-01-30T15:50:28Z

Maxxen
Jan 30, 2024
Maintainer

Hi! You could try reducing the max_batch_size optional parameter given to st_read if you have an excel file with a lot of columns. By default spatial will request 2048 rows at a time from the underlying GDAL library. AFAIK GDAL itself uses a streaming xml parser but Im not sure about the specifics on how it handles or allocates memory buffers for e.g. decompression. Regardless even if you OFFSET, DuckDB will have to read all the rows up to the offset.

Note that in general it is not always possible to incrementally read an excel file due to how the excel file format is constructed, particularly if you have a lot of unique text cells, as every unique string is (usually) stored in a separate part of the file that needs to be more or less read up front.

In the future I want to create a separate excel reader extension with our own implementation that can be more memory efficient, but I don't have a timeline on that.

9 replies

Maxxen Jan 31, 2024
Maintainer

Oh, also, are you running the beta version of spatial? I did some excel related fixes a couple of weeks back.
You can get it for 0.9.2 by running:

FORCE INSTALL spatial FROM 'http://nightly-extensions.duckdb.org';

tboddyspargo Jan 31, 2024
Author

I suppose the GDAL driver just has an access pattern that exhibits pathological behavior in combination with our httpfs filesystem abstraction.

Is there a way to force st_read to use the GDAL virtual file system instead of the httpfs file system?

Oh, also, are you running the beta version of spatial? I did some excel related fixes a couple of weeks back.

Which version of spatial do we get by default when using INSTALL spatial with a pre-release version of duckdb (like 0.9.3.dev3257)? I was not explicitly using the FROM ... syntax to get the beta version in my previous tests.

I re-ran the tests using duckdb==0.9.2 and FORCE INSTALL spatial FROM 'http://nightly-extensions.duckdb.org'; with slightly worse results:

peak memory usage appears the same (10MiB for a 1MiB file)
There seemed to be a 2x increase in time with the LIMIT 10 query compared to my testing with duckdb==0.9.3.dev3257 and default spatial as well as compared to the query without any LIMIT.
- It seems possible that the 2x difference might be point to a performance regression of some kind...

Test 1 - load 10 rows from s3://aws-bucket/BikeShare.xlsx
Time for 10 rows with options: {}: 79.463 seconds
Test 2 - load all rows from s3://aws-bucket/BikeShare.xlsx
Time for all rows with options: {}: 41.428 seconds

I also tried the same test with an S3-like service (seaweedfs) running locally without any SSL certificate and the timing was significantly faster. It's hard to know what might be contributing to that, but I wouldn't think network latency would be responsible for all of it. Could SSL be a significant contributor to the slowness? Is a single connection re-used, or does the driver have to re-negotiate TLS and certification validation repeatedly?

Local, non-SSL S3:

Test 1 - load 10 rows from s3://dev-bucket/BikeShare.xlsx
Time for 10 rows with options: {}: 1.074 seconds
Test 2 - load all rows from s3://dev-bucket/BikeShare.xlsx
Time for all rows with options: {}: 0.536 seconds

tboddyspargo Jan 31, 2024
Author

8-10x memory overhead seems excessive though.

Just a thought - does that number correspond to the approximate compression ratio of data in an Excel file? Or does that account for 2-3x of that memory overhead? I only just considered that contributing factor, so I tried exporting a 660 KiB, single-sheet Excel file to CSV and the resulting size was 1.3 MiB, so I thought that might be part of it.

Maxxen Jan 31, 2024
Maintainer

Right, so

Which version of spatial do we get by default when using INSTALL spatial with a pre-release version of duckdb (like 0.9.3.dev3257)? I was not explicitly using the FROM ... syntax to get the beta version in my previous tests.

This is admittedly pretty confusing and we're going to document this better next release. The spatial version you get when you INSTALL with a duckdb pre-release is pinned to a specific commit and is updated pretty irregularly every couple or weeks or whenever we feel that it has drifted significantly from the spatial main branch and main has been stable long enough. The version you get when using the latest stable duckdb (i.e. v0.9.2) and run INSTALL FROM 'http://nightly-extensions.duckdb.org ' is the "latest" bleeding edge spatial that gets built and distributed every time a PR is merged to main in this repo.

Is there a way to force st_read to use the GDAL virtual file system instead of the httpfs file system?

Yes, although I don't know if this still works on the nightly builds (INSTALL FROM). I had to inject our own virtual filesystem into GDAL to avoid threading issues since GDAL's fs is not thread safe (or wasm compatible) and I don't think you can chain multiple root virtual filesystems. But if it still works (should work with standard v0.9.2) you can trigger it by prefixing with /vsis3/, check here for how to set creds.

Just a thought - does that number correspond to the approximate compression ratio of data in an Excel file?

Could be, xlsx is really a bunch of XML files (way more overhead than csv) zipped together into an archive with deflate compression. I don't know what the expected file size overhead is in general but Id expect deflate to work pretty well on all the repetitive xml tags.

Could SSL be a significant contributor to the slowness? Is a single connection re-used, or does the driver have to re-negotiate TLS and certification validation repeatedly?

When I tried to profile this last it seemed like most of the time was spent copying individual bytes between different buffers. But AFAIK DuckDB's httpfs will always read-ahead and buffer like a megabyte or something. But unless I actually sit down and try to debug this again we can just guess. I really appreciate all the experimentation though!

That said Im not particularly interested in trying to fix up the excel driver given that excel support really was kind of a happy accident and has no real business being in the spatial extension in the first place, and there's like 5 other excel related issues that also originate deep in the driver. I have made quite a bit of progress on a separate native excel reader though, feel free to keep an eye on https://github.com/Maxxen/duckdb-xml. There should be some binaries available from the CI soon as soon as its green and I think I have a pretty robust and fast reading functionality implemented (although writing is still WIP). I can read the 100k file you linked remotely in less than 10s with around 20mb peak memory usage. It also supports projection pushdown and efficient initial skipping like the csv-reader.

tboddyspargo Jan 31, 2024
Author

Thanks so much for these clarifications and explanations - they're super helpful!

That said Im not particularly interested in trying to fix up the excel driver given that excel support really was kind of a happy accident and has no real business being in the spatial extension in the first place, and there's like 5 other excel related issues that also originate deep in the driver. I have made quite a bit of progress on a separate native excel reader though, feel free to keep an eye on https://github.com/Maxxen/duckdb-xml.

I very much appreciate you wanting to investing in a more purpose-built and performant solution for XML/Excel. I'll be very excited to see that extension evolve - I starred it and watched for releases!

cboettig · 2024-01-30T17:52:12Z

cboettig
Jan 30, 2024

@tboddyspargo Have you tried using st_read with the GDAL virtual filesystem interface? It looks like the GDAL xlsx driver supports it: https://gdal.org/drivers/vector/xlsx.html -- the virtual filesystem is usually quite excellent at operating on larger-than-ram data (either over posix or https or s3 protocols), though I haven't tested it with xlsx.

1 reply

tboddyspargo Jan 30, 2024
Author

Thanks for your response, @cboettig!

I kind of assumed that was the default behavior of a query like SELECT * FROM st_read('s3://bucket/path.xlsx', layer='sheet 1') LIMIT 10;. How will I know if I'm using the GDAL virtual filesystem? What are the alternatives? How do I know what the default might be for different file types? To force the use of that driver, would I need to add allowed_drivers=['XLSX'] to the st_read parameters? I can try that, if it's not already the default for Excel files (although, again, I assumed it was).

cboettig · 2024-02-01T00:02:27Z

cboettig
Feb 1, 2024

@tboddyspargo I suggest trying to read this with duckdb using the GDAL virtual filesystem instead of the duckdb one. see https://gdal.org/user/virtual_file_systems.html.

e.g. this takes a few seconds on my machine:

st_read("/vsicurl/https://github.com/duckdb/duckdb_spatial/files/14105978/PPP_Aid_to_Restaurants.xlsx")

I'm just using the current (or maybe an old) version of the spatial extension.

I'm mostly an R user so I use a little duckdb R wrapper for this:

# remotes::install_github("cboettig/duckdbfs")

bench::bench_time({
df <- duckdbfs::open_dataset("/vsicurl/https://github.com/duckdb/duckdb_spatial/files/14105978/PPP_Aid_to_Restaurants.xlsx")
})
# process    real 
#   7.04s   7.04s

1 reply

tboddyspargo Feb 1, 2024
Author

Thanks, @cboettig - I'll give that a shot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to prevent st_read from reading the full Excel file/sheet into memory? #243

{{title}}

Replies: 3 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Is there a way to prevent st_read from reading the full Excel file/sheet into memory? #243

tboddyspargo Jan 30, 2024

Replies: 3 comments · 11 replies

Maxxen Jan 30, 2024 Maintainer

Maxxen Jan 31, 2024 Maintainer

tboddyspargo Jan 31, 2024 Author

tboddyspargo Jan 31, 2024 Author

Maxxen Jan 31, 2024 Maintainer

tboddyspargo Jan 31, 2024 Author

cboettig Jan 30, 2024

tboddyspargo Jan 30, 2024 Author

cboettig Feb 1, 2024

tboddyspargo Feb 1, 2024 Author

tboddyspargo
Jan 30, 2024

Replies: 3 comments 11 replies

Maxxen
Jan 30, 2024
Maintainer

Maxxen Jan 31, 2024
Maintainer

tboddyspargo Jan 31, 2024
Author

tboddyspargo Jan 31, 2024
Author

Maxxen Jan 31, 2024
Maintainer

tboddyspargo Jan 31, 2024
Author

cboettig
Jan 30, 2024

tboddyspargo Jan 30, 2024
Author

cboettig
Feb 1, 2024

tboddyspargo Feb 1, 2024
Author