Feature/query to arrow table #11

nrlugg · 2023-11-10T06:28:09Z

Description

databricks.sql allows queries to be fetched as pyarrow.Tables, which retain the exact data types as they exist in the datalake schema. These can then be easily converted to pandas.DataFrames, retaining as much type information is possible when converting to Pandas/Numpy.

Changes

added DatabricksClient.get_arrow_table()
this works basically the same as DatabricksClient.get_df() except:
- it outputs a pyarrow.Table
- it uses cursor.fetchall_arrow() interally (c.f. cursor.fetchall())

Changes -- rejected

I've added an optional argument format: Literal["python", "pandas", "pyarrow"] to DatabricksClient.query() to allow the query result to be output in the desired format.
The default is format="python" so that calls to DatabricksClient.query() with no format argument should return the same result as before the change.
I've left DatabricksClient.get_df() unchanged so that there are no unexpected breakages due to change in type (c.f. using query(..., format="pandas") which will use pyarrow.Table.to_pandas() to fetch the pandas.DataFrame)

capitancambio

Not that I'm 100% percent against this, but I wonder if you could just convert to arrow on your side. also keep in mind that 100% of the data analysts will be able to understand what's going on with pandas but not so much with arrow, adding cognitive load to the rest of the team for a personal taste is something to keep in mind. If the client code if something that only you will be using then fine.

In any case, could you please add test coverage for the new functionality in the query method?

nrlugg · 2023-11-14T00:07:03Z

Understood about the cognitive load -- I tried to hide the ability to output PyArrow Tables and keep the existing method for getting Pandas DataFrames so that anyone unfamiliar with PyArrow does not have to use it at any point.

For reference, despite what I said in Slack, the main reason for suggesting this is not (entirely) personal taste: fetching the PyArrow tables retains the exact data types as they appear in the SQL server which I though might be beneficial. But if the (admittedly small) benefit doesn't outweigh the potential negatives of cognitive load, then I'm more than happy to just close this PR ;) In any case, I'll add the tests.

Alternatively, if you think it's less intrusive, I could just add a method like get_arrow_table()

alexmalins · 2023-11-14T02:40:00Z

Alternatively, if you think it's less intrusive, I could just add a method like get_arrow_table()

+1 for this way of doing things 👍 having separate methods for each return type is cleaner IMV than overloading and having a single method that can return different data type objects

nrlugg · 2023-11-14T07:15:30Z

I've completely refactored the PR which now simply implements a get_arrow_table() method (plus tests)

nrlugg force-pushed the feature/query-to-arrow-table branch 2 times, most recently from 5fe6197 to 113c12b Compare November 10, 2023 07:24

capitancambio requested changes Nov 13, 2023

View reviewed changes

Nathan Lugg added 3 commits November 14, 2023 15:58

typing: use Row instead of tuple

a1a8c21

add get_arrow_table() method

0765cc0

add test

f8a8df0

nrlugg force-pushed the feature/query-to-arrow-table branch from 113c12b to f8a8df0 Compare November 14, 2023 07:10

formatting

a60cfd6

nrlugg marked this pull request as ready for review November 14, 2023 07:14

nrlugg requested a review from capitancambio November 16, 2023 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/query to arrow table #11

Feature/query to arrow table #11

nrlugg commented Nov 10, 2023 •

edited

Loading

capitancambio left a comment

nrlugg commented Nov 14, 2023

alexmalins commented Nov 14, 2023

nrlugg commented Nov 14, 2023

Feature/query to arrow table #11

Are you sure you want to change the base?

Feature/query to arrow table #11

Conversation

nrlugg commented Nov 10, 2023 • edited Loading

Description

Changes

Changes -- rejected

capitancambio left a comment

Choose a reason for hiding this comment

nrlugg commented Nov 14, 2023

alexmalins commented Nov 14, 2023

nrlugg commented Nov 14, 2023

nrlugg commented Nov 10, 2023 •

edited

Loading