Document `EnableQueryResultDownload` for `databricks()` #856

MCMaurer · 2024-10-17T21:58:02Z

Without setting EnableQueryResultDownload='0' in the Databricks connection string, queries of > ~40k rows will return no data and a relatively cryptic error message:

[Simba][Hardy] (35) Error from server: error code: '0' error message: '[Simba][Hardy] (134) File fe48a1be-0a4e-482e-aa09-3ec21215756d: A retriable error occurred while attempting to download a result file from the cloud store but the retry limit had been exceeded. Error Detail: File fe48a1be-0a4e-482e-aa09-3ec21215756d: An error had occurred while attempting to download the result file.The error is: Couldn't resolve host name. Since the connection has been configured to consider all result file download as retriable errors, we will attempt to retry the download.'.

However, all it takes to solve this is adding EnableQueryResultDownload='0' to the connection string or DBI() call. Then all rows will be returned.

This isn't found anywhere within Databricks' documentation, and figuring it out involves a fair bit of Stack Overflow sleuthing. It's also very different behavior compared to many other DBI connections, which could lead to extra confusion. I think it would be helpful to include this tip in the documentation for odbc::databricks().

The text was updated successfully, but these errors were encountered:

atheriel · 2024-10-17T23:41:39Z

Are there any potential downsides to setting this by default?

MCMaurer · 2024-10-18T14:00:22Z

I couldn't find much about the setting, given that it's not even in Databricks' own documentation. I would be hesitant to override Databricks' default behavior though, since it also applies to pyodbc and other ODBC connections. I think keeping the default behavior but adding it to the list of documented arguments would be a good balance.

simonpcouch · 2024-10-18T14:32:45Z

It looks like EnableQueryResultDownload toggles "Cloud Fetch", which is some sort of query optimization technique for BI tools. I think I agree that it's probably not optimal for us to set this by default.

I'd be tempted to try and catch + rethrow this error with a more informative message, telling users to set that parameter. There's not much in this error message that feels specific to this issue, but it does seem like "[Simba][Hardy] (134)" and "cloud store but the retry limit" are enough to narrow down the search results to only those with a solution pointing to EnableQueryResultDownload.

MCMaurer · 2024-10-18T14:59:03Z

I think throwing a more informative error and suggesting this setting would be great. You're right that with some good error message copying and googling, a relatively savvy user could figure it out, but I figure it wouldn't hurt to save folks that step.

I'm thinking just from a user perspective, the expectation is that the success of query made via DBI/odbc won't depend on the number of rows returned, other than the possibility of running out of memory in R. The default for databricks() doesn't meet this expectation, which isn't the worst thing, as long as users can pretty quickly get to the expected behavior.

It's also weird that while Cloud Fetch is mentioned wrt Databricks ODBC in the Azure documentation:

Cloud Fetch is only used for query results larger than 1 MB. Smaller results are retrieved directly from Azure Databricks.

There is no mention of the parameter that needs to be set in the connection string.

hadley · 2024-10-18T15:33:55Z

We've pinged out internal databricks contacts, so we'll wait to hear back from them before we implement anything.

zacdav-db · 2024-10-22T00:39:49Z

@MCMaurer we don't recommend setting this by default as its a significant performance improvement for larger result sets.
Additionally, its not something we publicly document as we consider this setting a last resort if the underlying issue cannot be resolved.

We'd love to help get to the bottom of the underlying cause for the data not appearing, which from the error, is likely networking related.

... The error is: Couldn't resolve host name ...

If you have a dedicated Databricks contact already, please reach out and submit a ticket, otherwise we can try arrange a way to get to bottom of it.

MCMaurer · 2024-10-22T14:33:39Z

Thanks all- just got contacted by our organization's Databricks rep. I'll update here if there's anything pertinent that comes up in our conversation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document `EnableQueryResultDownload` for `databricks()` #856

Document `EnableQueryResultDownload` for `databricks()` #856

MCMaurer commented Oct 17, 2024

atheriel commented Oct 17, 2024

MCMaurer commented Oct 18, 2024

simonpcouch commented Oct 18, 2024

MCMaurer commented Oct 18, 2024

hadley commented Oct 18, 2024

zacdav-db commented Oct 22, 2024

MCMaurer commented Oct 22, 2024 •

edited

Loading

Document EnableQueryResultDownload for databricks() #856

Document EnableQueryResultDownload for databricks() #856

Comments

MCMaurer commented Oct 17, 2024

atheriel commented Oct 17, 2024

MCMaurer commented Oct 18, 2024

simonpcouch commented Oct 18, 2024

MCMaurer commented Oct 18, 2024

hadley commented Oct 18, 2024

zacdav-db commented Oct 22, 2024

MCMaurer commented Oct 22, 2024 • edited Loading

Document `EnableQueryResultDownload` for `databricks()` #856

Document `EnableQueryResultDownload` for `databricks()` #856

MCMaurer commented Oct 22, 2024 •

edited

Loading