-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/query to arrow table #11
base: master
Are you sure you want to change the base?
Conversation
5fe6197
to
113c12b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I'm 100% percent against this, but I wonder if you could just convert to arrow on your side. also keep in mind that 100% of the data analysts will be able to understand what's going on with pandas but not so much with arrow, adding cognitive load to the rest of the team for a personal taste is something to keep in mind. If the client code if something that only you will be using then fine.
In any case, could you please add test coverage for the new functionality in the query
method?
Understood about the cognitive load -- I tried to hide the ability to output PyArrow Tables and keep the existing method for getting Pandas DataFrames so that anyone unfamiliar with PyArrow does not have to use it at any point. For reference, despite what I said in Slack, the main reason for suggesting this is not (entirely) personal taste: fetching the PyArrow tables retains the exact data types as they appear in the SQL server which I though might be beneficial. But if the (admittedly small) benefit doesn't outweigh the potential negatives of cognitive load, then I'm more than happy to just close this PR ;) In any case, I'll add the tests. Alternatively, if you think it's less intrusive, I could just add a method like |
+1 for this way of doing things 👍 having separate methods for each return type is cleaner IMV than overloading and having a single method that can return different data type objects |
113c12b
to
f8a8df0
Compare
I've completely refactored the PR which now simply implements a |
Description
databricks.sql
allows queries to be fetched aspyarrow.Table
s, which retain the exact data types as they exist in the datalake schema. These can then be easily converted topandas.DataFrame
s, retaining as much type information is possible when converting to Pandas/Numpy.Changes
DatabricksClient.get_arrow_table()
DatabricksClient.get_df()
except:pyarrow.Table
cursor.fetchall_arrow()
interally (c.f.cursor.fetchall()
)Changes -- rejected
format: Literal["python", "pandas", "pyarrow"]
toDatabricksClient.query()
to allow the query result to be output in the desired format.format="python"
so that calls toDatabricksClient.query()
with noformat
argument should return the same result as before the change.DatabricksClient.get_df()
unchanged so that there are no unexpected breakages due to change in type (c.f. usingquery(..., format="pandas")
which will usepyarrow.Table.to_pandas()
to fetch thepandas.DataFrame
)