Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr::copy_to() performance with odbc 1.5.0 #862

Open
DavisVaughan opened this issue Nov 11, 2024 · 3 comments
Open

dplyr::copy_to() performance with odbc 1.5.0 #862

DavisVaughan opened this issue Nov 11, 2024 · 3 comments

Comments

@DavisVaughan
Copy link

See tidyverse/dplyr#7103

@simonpcouch
Copy link
Collaborator

Looping in @nathanhaigh—thanks for the issue. Related to #774 and #391. We've been iterating on this default over time and looks like the most recent changes introduce a slowdown in that context (Snowflake, larger data) you've noted.

Maybe we could introduce a batch_rows() helper with DBMS-specific methods.

@detule
Copy link
Collaborator

detule commented Nov 12, 2024

Hey @simonpcouch

I see that @nathanhaigh has discovered how he can use the package options to adjust the batch size. Following along his ( excellent ) write-up it seems like he is mostly curious why the change to the default was made.

I'll just chime in with motivation - I know you are aware of this since you addressed the issue originally, but posting here for general benefit. I think the pull request came as a result of at least one person complaining that with the previous set of parameters, RAM usage was excessive. It makes sense, when inserting a large data set in a single batch, we populate potentially very big buffers and pass them onto the driver.

We should probably have a more sophisticated default than 1024. For example if inserting a skinny table of numerical data, we can probably afford to grab many (all) rows at once. If, on the other hand, we have a wide table with LONG fields, we may need to adjust the number of rows according to some gradient. This type of analysis would in-and-of-itself come as a performance hit, but it sounds like it may be small compared to what some users (that are less memory constrained) may be experiencing.

@nathanhaigh
Copy link

nathanhaigh commented Nov 13, 2024

I am curious about the change as it seems a very low default - perhaps that says more about the data science space I operate in but I would have thought tables of 10's-100's thousands of rows wouldn't be considered particularly big. Inserting in batches of 1024 rows would then require 10's-100's of insert statements. In the end, 1024 rows may only represent a very small amount of data so a more intelligent approach based on data size (e.g. chunking to xMB) might be a better approach. Perhaps one option might be to determine the number of bytes (object.size(my_table)) used by the table together with the number of rows to figure out an appropriate number for batch_rows.

@detule The issue you link to above seems to be more about a memory leak rather than memory usage. In that sense would batch_rows seems to be more about limiting memory usage, perhaps with the side effect of reducing memory leakage?

From a dplyr user perspective, setting the odbc.batch_rows option was not an obvious solution as it buried a few layers down and it took me some time to discover it as an potential solution to the performance issues I was observing. @simonpcouch - How would the following statement manifest itself to the user?

we could introduce a batch_rows() helper with DBMS-specific methods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants