Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document EnableQueryResultDownload for databricks() #856

Open
MCMaurer opened this issue Oct 17, 2024 · 7 comments
Open

Document EnableQueryResultDownload for databricks() #856

MCMaurer opened this issue Oct 17, 2024 · 7 comments

Comments

@MCMaurer
Copy link

Without setting EnableQueryResultDownload='0' in the Databricks connection string, queries of > ~40k rows will return no data and a relatively cryptic error message:

[Simba][Hardy] (35) Error from server: error code: '0' error message: '[Simba][Hardy] (134) File fe48a1be-0a4e-482e-aa09-3ec21215756d: A retriable error occurred while attempting to download a result file from the cloud store but the retry limit had been exceeded. Error Detail: File fe48a1be-0a4e-482e-aa09-3ec21215756d: An error had occurred while attempting to download the result file.The error is: Couldn't resolve host name. Since the connection has been configured to consider all result file download as retriable errors, we will attempt to retry the download.'.

However, all it takes to solve this is adding EnableQueryResultDownload='0' to the connection string or DBI() call. Then all rows will be returned.

This isn't found anywhere within Databricks' documentation, and figuring it out involves a fair bit of Stack Overflow sleuthing. It's also very different behavior compared to many other DBI connections, which could lead to extra confusion. I think it would be helpful to include this tip in the documentation for odbc::databricks().

@atheriel
Copy link
Collaborator

Are there any potential downsides to setting this by default?

@MCMaurer
Copy link
Author

I couldn't find much about the setting, given that it's not even in Databricks' own documentation. I would be hesitant to override Databricks' default behavior though, since it also applies to pyodbc and other ODBC connections. I think keeping the default behavior but adding it to the list of documented arguments would be a good balance.

@simonpcouch
Copy link
Collaborator

It looks like EnableQueryResultDownload toggles "Cloud Fetch", which is some sort of query optimization technique for BI tools. I think I agree that it's probably not optimal for us to set this by default.

I'd be tempted to try and catch + rethrow this error with a more informative message, telling users to set that parameter. There's not much in this error message that feels specific to this issue, but it does seem like "[Simba][Hardy] (134)" and "cloud store but the retry limit" are enough to narrow down the search results to only those with a solution pointing to EnableQueryResultDownload.

@MCMaurer
Copy link
Author

I think throwing a more informative error and suggesting this setting would be great. You're right that with some good error message copying and googling, a relatively savvy user could figure it out, but I figure it wouldn't hurt to save folks that step.

I'm thinking just from a user perspective, the expectation is that the success of query made via DBI/odbc won't depend on the number of rows returned, other than the possibility of running out of memory in R. The default for databricks() doesn't meet this expectation, which isn't the worst thing, as long as users can pretty quickly get to the expected behavior.

It's also weird that while Cloud Fetch is mentioned wrt Databricks ODBC in the Azure documentation:

Cloud Fetch is only used for query results larger than 1 MB. Smaller results are retrieved directly from Azure Databricks.

There is no mention of the parameter that needs to be set in the connection string.

@hadley
Copy link
Member

hadley commented Oct 18, 2024

We've pinged out internal databricks contacts, so we'll wait to hear back from them before we implement anything.

@zacdav-db
Copy link

@MCMaurer we don't recommend setting this by default as its a significant performance improvement for larger result sets.
Additionally, its not something we publicly document as we consider this setting a last resort if the underlying issue cannot be resolved.

We'd love to help get to the bottom of the underlying cause for the data not appearing, which from the error, is likely networking related.

... The error is: Couldn't resolve host name ...

If you have a dedicated Databricks contact already, please reach out and submit a ticket, otherwise we can try arrange a way to get to bottom of it.

@MCMaurer
Copy link
Author

MCMaurer commented Oct 22, 2024

Thanks all- just got contacted by our organization's Databricks rep. I'll update here if there's anything pertinent that comes up in our conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants