-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Implement spark.GBQQueryDataset
for reading data from BigQuery as a spark dataframe using SQL query
#971
base: main
Are you sure you want to change the base?
Conversation
spark.GBQQueryDataset
for reading spark dataframes from BigQuery using SQL queryspark.GBQQueryDataset
for reading data from BigQuery as a spark dataframe using SQL query
So this is a great first start - the credentials resolution looks complicated, but also well thought out. I think we'd need to see some tests for this to go in as is, you can take some inspiration from the pandas equivalent That also being said we could look at contributing this to the experimental part of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @abhi8893 , this is very promising! We will look at this soon
Thanks @datajoely , @astrojuanlu Credentials HandlingYes, the credentials handling is a bit different than the rest of the datasets. Also, I do think that Earlier, I have been reading bigquery tables using my_dataset:
type: spark.SparkDataset
file_format: bigquery
filepath: /tmp/my_table.parquet # Just because it's an non optional arg with type `str`
load_args:
table: "<my_project_id>.<my_dataset>.<my_table>"
export GOOGLE_APPLICATION_CREDENTIALS=`/path/to/credentials.json
spark.hadoop.google.cloud.auth.service.account.enable: true
spark.hadoop.google.cloud.auth.service.account.json.keyfile: /path/to/credentials.json With above dataset, I wanted to allow passing credentials directly to the dataset. But it seems, we may have to standardize it a little bit for all other kedro datasets for this GCP case. Implementing testsAnd let me take a look at how tests can be implemented for this. Initial thoughts: Since this doesn't involve a bigquery client, hence the method of mocking (as in Moving to experimentalFor moving this to experimental, let me know and I'll lift and shift this to |
Added some tests. Related to credentials, one thing that isn't implemented here is reading the SQL query from a text file, which maybe placed anywhere, hence reading the SQL text file itself may require credentials. This is already implemented in So this can necessisate providing 2 types of credentials? Or maybe just one could suffice, and use it authenticate with BigQuery as well as any storage backend? |
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
…ching spark credentials Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
53555b6
to
cc6c930
Compare
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
Signed-off-by: Abhishek Bhatia <[email protected]>
So, now I have implemented passing separate credentials for I feel the credentials handling is a bit unstandardized across datasets, however I have done what suits this case best. Found a few kedro discussions on credentials: Part of what makes it tricky is:
Happy to contribute to standardizing credentials handling in kedro as we move forward. However I do realise, this would require quite a bit of research (and user surveys) to perfect. |
Signed-off-by: Abhishek Bhatia <[email protected]>
Description
spark.SparkDataset
does not support reading data from bigquery using a SQL query.spark.SparkDataset
may not comply withkedro_datasets
design principles as it requires makingfilepath
as an optional argument.pandas.GBQQueryDataset
, thespark.GBQQueryDataset
is also aread-only
dataset, hence it's a more suited implementation to maintain the overall design of datasets.Development notes
To test the dataset, the following is the manual way:
project_id
)<project)_id>.<test_dataset>
<project)_id>.<test_mat_dataset>
<project)_id>.<test_dataset>.<test_table>
Checklist
jsonschema/kedro-catalog-X.XX.json
if necessaryRELEASE.md
file