Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snowflake: runtime driver config checks #857

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

detule
Copy link
Collaborator

@detule detule commented Oct 28, 2024

Hi @simonpcouch:

Following up on a conversation we had a few weeks ago where I noticed that the simba snowflake drivers on macOS suffer from similar issues as the ones you ran into with databricks. In particular, and this is one that bit me as well, the simba configuration file (.ini) needs to specify DriverManagerEncoding=UTF-16. The OEM SNOWFLAKE driver, for example, looks to be configured for iODBC, and has a different value that doesn't work for us.

In this PR I expanded some of the checks you had written for databricks to snowflake as well, with the following changes:

  • Rather than modifying the config by default, I made the snowflake code path throw a warning. Left the databricks default behavior unchanged.
  • Added an option to disable to config check altogether, for both.
  • Moved the config check ( whether warn or modify ) further down the call stack so as to be able to read the decided driver. This to avoid a situation where we may be warning about, or modifying one driver, but the user has specified a different one.

The last bullet in particular, I think is why the diff ended up being (much) larger than I originally would have anticipated.

Rather than duplicating methods, tried to make some of the methods "simba" generic ( tried to keep those in utils.R ), while others that were specific one of the two back ends I moved to driver-*.R. We are close to perhaps needing to make some of those specific methods S3 or S4, but at this point IMO, that seems like a bit of an overkill.

Cheers.

TODO:

  • Add news item

@hadley
Copy link
Member

hadley commented Oct 29, 2024

@atheriel can you take a look at this too please?

Copy link
Collaborator

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first pass through! I'm really happy we're aiming to resolve this issue for Snowflake users, too.

I said this in PM as well but wanted to reiterate once more here in case other folks support (feel free to ignore and I won't bug you any longer😜): I think the same issue should lead to the same outcome across drivers, and I'd argue that we ought to just edit the file as we do with Databricks. Especially given that this PR introduces odbc.no_config_override (on board!), I'm fine with the more invasive approach given that it allows for this connection to "just work" for users. No longer needing to manage the different behavior for difference DBMS', I think, would greatly simplify this PR as well.

R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated
# performs.
# 3. If action == "modify" then we attempt to modify the config in-situ.
# 4. Otherwise we throw a warning asking the user to revise.
configure_simba <- function(locate_config_callback = (function() character()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
configure_simba <- function(locate_config_callback = (function() character()),
configure_simba <- function(locate_config_callback = character(),

Poking at this for a bit, it's unclear to me why this is a function rather than a character string? In this function, we just call the inputted function which is inlined as returning a character everywhere it's currently used, and then use that character result as usual. Maybe this function just takes the config file as a string and we pass e.g. snowflake_simba_config(args$driver) rather than function() snowflake_simba_config(args$driver)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Simon:

Wrote it as a callback because in may cases ( not macos, or no_config_override is set ) it may not get executed at all. Can move those checks outside of the method, but it felt like they belong within ( and they would need to be duplicated for each call ).

Let me know if you think it's worth trying to chase this down further / refactor more.

Copy link
Collaborator

@simonpcouch simonpcouch Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! Since those callbacks take some time (and could even raise some unanticipated error), you opt to wait to evaluate the code until you're sure you need its result.

We're actually lucky here in that R will evaluate function arguments lazily. So, if you pass a function call as an argument to f(), do some checks, and then reference the result of the function call after the checks, the function call will never actually evaluate if the check returns beforehand. e.g.:

fn_a <- function() {
  cli::cli_abort("Stoppp!")
}

fn_b <- function(x) {
  if (TRUE) {
    cli::cli_abort("{.fun fn_a} hasn't evaluated yet!")
  }
  
  x
}

fn_b(x = fn_a())
#> Error in `fn_b()`:
#> ! `fn_a()` hasn't evaluated yet!

Created on 2024-11-12 with reprex v2.1.1

...the implication here being that we can pass snowflake_simba_config(args$driver) directly as the function argument and it won't actually run until those Not macOS or No no_config_override checks have evaluated.

R/driver-databricks.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@atheriel atheriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think efforts to make drivers "just work" are worthwhile for sure, so in that sense I'm supportive. However, it's a little unclear to me from the code whether this only applies to macOS users or if it's designed to help Linux users, too.

If the former, I think the code should be more explicit about that. If the latter, I think the Linux paths are missing from spark_simba_config() and snowflake_simba_config().

tests/testthat/test-utils.R Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
@detule
Copy link
Collaborator Author

detule commented Nov 10, 2024

A first pass through! I'm really happy we're aiming to resolve this issue for Snowflake users, too.

I said this in PM as well but wanted to reiterate once more here in case other folks support (feel free to ignore and I won't bug you any longer😜): I think the same issue should lead to the same outcome across drivers, and I'd argue that we ought to just edit the file as we do with Databricks. Especially given that this PR introduces odbc.no_config_override (on board!), I'm fine with the more invasive approach given that it allows for this connection to "just work" for users. No longer needing to manage the different behavior for difference DBMS', I think, would greatly simplify this PR as well.

Thanks Simon - appreciate you taking a look. Apologies for misinterpreting your earlier note on whether to modify in place or just warn. I changed snowflake so that the behavior is the same as databricks ( modify ).

I did leave the warn code in the method, in case we want to pursue that option in the future. However, happy to remove if you have concerns.

@detule
Copy link
Collaborator Author

detule commented Nov 10, 2024

I think efforts to make drivers "just work" are worthwhile for sure, so in that sense I'm supportive. However, it's a little unclear to me from the code whether this only applies to macOS users or if it's designed to help Linux users, too.

If the former, I think the code should be more explicit about that. If the latter, I think the Linux paths are missing from spark_simba_config() and snowflake_simba_config().

Hey @atheriel - appreciate the feedback.

It's the former - at least for now. Most of these issues are coming from the fact that vendors seem to be distributing MacOS drivers that, out-of-the-box, are configured to work when paired with iODBC, rather than unixODBC. Our package, can't switch between unixODBC and iODBC at run-time; rather, that's determined at build-time ( and is configured for unixODBC). I haven't seen these issues with the OEM drivers for Linux.

At some point, wouldn't mind doing some investigative work to determine what it would take to be able to adjust our posture relative to the driver managers more dynamically.

  • The configure_simba method opens with
if (!is_macos()) {
  return(invisible())
}
  • There is a mention in the user documentation I added, that these checks only happen on 'macOS'.

  • Based on your feedback, I also added comments prior to the invocation in driver-snowflake|databricks to remind the reader of the same.

Let me know if you had something else in mind.
Thanks again

lines[matching_lines_loc] <- replacement
}
lines
return(list("new_lines" = lines, "modified" = !found_ok))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return(list("new_lines" = lines, "modified" = !found_ok))
return(list(new_lines = lines, modified = !found_ok))

matching_lines <- lines[matching_lines_loc]
found_ok = length(matching_lines) != 0 &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
found_ok = length(matching_lines) != 0 &&
found_ok <- length(matching_lines) != 0 &&

cli::cli_warn(c(
i = "Detected potentially unsafe driver settings.
Please consider revising the {.arg ODBCInstLib} field in
{simba_config} and setting its value to {unixodbc_install}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{simba_config} and setting its value to {unixodbc_install}"
{.file {simba_config}} and setting its value to {unixodbc_install}."

To make the filename a clickable link in RStudio/Positron!

# driver argument could be an outright path, or a name
# of a driver specified in odbcinst.ini Try to discern
driver_spec <- subset(odbcListDrivers(), name == driver)
if (nrow(driver_spec) ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (nrow(driver_spec) ) {
if (nrow(driver_spec)) {

Warning:
i Detected potentially unsafe driver settings. Please consider revising the `ODBCInstLib` field in simba.sparkodbc.ini and setting its value to libodbcinst.dylib
Warning:
i Detected potentially unsafe driver settings. Please consider revising the `DriverManagerEncoding` field in simba.sparkodbc.ini and setting its value to 'UTF-16'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this warning is duplicated in practice?

R/utils.R Outdated
# performs.
# 3. If action == "modify" then we attempt to modify the config in-situ.
# 4. Otherwise we throw a warning asking the user to revise.
configure_simba <- function(locate_config_callback = (function() character()),
Copy link
Collaborator

@simonpcouch simonpcouch Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! Since those callbacks take some time (and could even raise some unanticipated error), you opt to wait to evaluate the code until you're sure you need its result.

We're actually lucky here in that R will evaluate function arguments lazily. So, if you pass a function call as an argument to f(), do some checks, and then reference the result of the function call after the checks, the function call will never actually evaluate if the check returns beforehand. e.g.:

fn_a <- function() {
  cli::cli_abort("Stoppp!")
}

fn_b <- function(x) {
  if (TRUE) {
    cli::cli_abort("{.fun fn_a} hasn't evaluated yet!")
  }
  
  x
}

fn_b(x = fn_a())
#> Error in `fn_b()`:
#> ! `fn_a()` hasn't evaluated yet!

Created on 2024-11-12 with reprex v2.1.1

...the implication here being that we can pass snowflake_simba_config(args$driver) directly as the function argument and it won't actually run until those Not macOS or No no_config_override checks have evaluated.

Copy link
Collaborator

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Leaving review as Comment one more time just to get eyes on the callback revision before sending this one in. Thanks for making this happen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants