-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SparkDataset
authentication via Unity Catalog and Databricks external locations
#836
Comments
Not directly, because I use external tables and dynamicPartitionOverwrite, which don't seem to be supported. I could probably create a custom I think it would be great to have an opinionated way of easily integrating kedro with the latest Databricks features (Unity Catalog, workflows, external locations, databricks-connect, databricks-hosted mlflow, etc.), as it is the most common ML platform used with kedro (used by 43% of kedro users) If you have any ideas in mind I can try to help with discussions or implementation |
I think this PR will potentially resolve this? |
@MigQ2, I think #827 is a good direction for a few reason:
if #827 is merged, would that be enough to solve your problem? |
I agree, merging #827 would give me a working solution. Still it would be nice to align both datasets in the future but wouldn't be a blocker |
Context
Currently, the preferred method of authentication with a datalake or cloud storage when using Databricks is via Unity Catalog and external locations, not directly authenticating to the storage.
If properly configured, when using Databricks or databricks-connect one should be able to use spark to read from cloud storage without explicitly providing a key or direct authentication method with the storage, which makes it safer, more auditable and gives more granular access control
Description
When using Azure and
abfss://
paths, the currentSparkDataset
implementation tries to connect to the storage directly using fsspec and a credential when initializing the Dataset.Therefore, it forces me to give my kedro project a credential to the
abfss://
ADLS.I want my kedro project to be able to read and write using spark using Unity Catalog external location authentication and not being able to have direct access to the underlying storage
I'm not clear on why
SparkDataset
needs to initialize the filesystem. It seems to be used later in _load_schema_from_file() but I'm not clear on why this is neededPossible Implementation
Would it be possible to completely remove all fsspec interactions with the data and make it all via Spark?
The text was updated successfully, but these errors were encountered: