-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dplyr::copy_to() performance with odbc 1.5.0 #862
Comments
Looping in @nathanhaigh—thanks for the issue. Related to #774 and #391. We've been iterating on this default over time and looks like the most recent changes introduce a slowdown in that context (Snowflake, larger data) you've noted. Maybe we could introduce a |
Hey @simonpcouch I see that @nathanhaigh has discovered how he can use the package options to adjust the batch size. Following along his ( excellent ) write-up it seems like he is mostly curious why the change to the default was made. I'll just chime in with motivation - I know you are aware of this since you addressed the issue originally, but posting here for general benefit. I think the pull request came as a result of at least one person complaining that with the previous set of parameters, RAM usage was excessive. It makes sense, when inserting a large data set in a single batch, we populate potentially very big buffers and pass them onto the driver. We should probably have a more sophisticated default than 1024. For example if inserting a skinny table of numerical data, we can probably afford to grab many (all) rows at once. If, on the other hand, we have a wide table with LONG fields, we may need to adjust the number of rows according to some gradient. This type of analysis would in-and-of-itself come as a performance hit, but it sounds like it may be small compared to what some users (that are less memory constrained) may be experiencing. |
I am curious about the change as it seems a very low default - perhaps that says more about the data science space I operate in but I would have thought tables of 10's-100's thousands of rows wouldn't be considered particularly big. Inserting in batches of 1024 rows would then require 10's-100's of insert statements. In the end, 1024 rows may only represent a very small amount of data so a more intelligent approach based on data size (e.g. chunking to xMB) might be a better approach. Perhaps one option might be to determine the number of bytes ( @detule The issue you link to above seems to be more about a memory leak rather than memory usage. In that sense would From a dplyr user perspective, setting the
|
See tidyverse/dplyr#7103
The text was updated successfully, but these errors were encountered: