Is there a way to prevent st_read from reading the full Excel file/sheet into memory? #243
Replies: 3 comments 11 replies
-
Hi! You could try reducing the Note that in general it is not always possible to incrementally read an excel file due to how the excel file format is constructed, particularly if you have a lot of unique text cells, as every unique string is (usually) stored in a separate part of the file that needs to be more or less read up front. In the future I want to create a separate excel reader extension with our own implementation that can be more memory efficient, but I don't have a timeline on that. |
Beta Was this translation helpful? Give feedback.
-
@tboddyspargo Have you tried using |
Beta Was this translation helpful? Give feedback.
-
@tboddyspargo I suggest trying to read this with duckdb using the GDAL virtual filesystem instead of the duckdb one. see https://gdal.org/user/virtual_file_systems.html. e.g. this takes a few seconds on my machine:
I'm just using the current (or maybe an old) version of the spatial extension. I'm mostly an R user so I use a little duckdb R wrapper for this: # remotes::install_github("cboettig/duckdbfs")
bench::bench_time({
df <- duckdbfs::open_dataset("/vsicurl/https://github.com/duckdb/duckdb_spatial/files/14105978/PPP_Aid_to_Restaurants.xlsx")
})
# process real
# 7.04s 7.04s |
Beta Was this translation helpful? Give feedback.
-
After performing some profiling tests, I've concluded that using
st_read
with an Excel file is not memory safe in the way thatread_csv_auto
is (eg. when usingLIMIT
andOFFSET
). I'd like to make sure that, even when loading a very large Excel sheet, I can read it in chunks and not need to worry about having enough memory to store the entire file. Is that possible today? If not, is there a plan for it or any alternative approaches that might allow me to continue using duckdb for this use-case, but achieve partial reading of an Excel file as a duckdb relation object?Beta Was this translation helpful? Give feedback.
All reactions