Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/query to arrow table #11

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

nrlugg
Copy link

@nrlugg nrlugg commented Nov 10, 2023

Description

databricks.sql allows queries to be fetched as pyarrow.Tables, which retain the exact data types as they exist in the datalake schema. These can then be easily converted to pandas.DataFrames, retaining as much type information is possible when converting to Pandas/Numpy.

Changes

  • added DatabricksClient.get_arrow_table()
  • this works basically the same as DatabricksClient.get_df() except:
    • it outputs a pyarrow.Table
    • it uses cursor.fetchall_arrow() interally (c.f. cursor.fetchall())

Changes -- rejected

  • I've added an optional argument format: Literal["python", "pandas", "pyarrow"] to DatabricksClient.query() to allow the query result to be output in the desired format.
  • The default is format="python" so that calls to DatabricksClient.query() with no format argument should return the same result as before the change.
  • I've left DatabricksClient.get_df() unchanged so that there are no unexpected breakages due to change in type (c.f. using query(..., format="pandas") which will use pyarrow.Table.to_pandas() to fetch the pandas.DataFrame)

@nrlugg nrlugg force-pushed the feature/query-to-arrow-table branch 2 times, most recently from 5fe6197 to 113c12b Compare November 10, 2023 07:24
Copy link
Contributor

@capitancambio capitancambio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm 100% percent against this, but I wonder if you could just convert to arrow on your side. also keep in mind that 100% of the data analysts will be able to understand what's going on with pandas but not so much with arrow, adding cognitive load to the rest of the team for a personal taste is something to keep in mind. If the client code if something that only you will be using then fine.

In any case, could you please add test coverage for the new functionality in the query method?

@nrlugg
Copy link
Author

nrlugg commented Nov 14, 2023

Understood about the cognitive load -- I tried to hide the ability to output PyArrow Tables and keep the existing method for getting Pandas DataFrames so that anyone unfamiliar with PyArrow does not have to use it at any point.

For reference, despite what I said in Slack, the main reason for suggesting this is not (entirely) personal taste: fetching the PyArrow tables retains the exact data types as they appear in the SQL server which I though might be beneficial. But if the (admittedly small) benefit doesn't outweigh the potential negatives of cognitive load, then I'm more than happy to just close this PR ;) In any case, I'll add the tests.

Alternatively, if you think it's less intrusive, I could just add a method like get_arrow_table()

@alexmalins
Copy link

Alternatively, if you think it's less intrusive, I could just add a method like get_arrow_table()

+1 for this way of doing things 👍 having separate methods for each return type is cleaner IMV than overloading and having a single method that can return different data type objects

@nrlugg nrlugg force-pushed the feature/query-to-arrow-table branch from 113c12b to f8a8df0 Compare November 14, 2023 07:10
@nrlugg nrlugg marked this pull request as ready for review November 14, 2023 07:14
@nrlugg
Copy link
Author

nrlugg commented Nov 14, 2023

I've completely refactored the PR which now simply implements a get_arrow_table() method (plus tests)

@nrlugg nrlugg requested a review from capitancambio November 16, 2023 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants