Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SparkDataset authentication via Unity Catalog and Databricks external locations #836

Open
MigQ2 opened this issue Sep 14, 2024 · 5 comments
Labels
Community Issue/PR opened by the open-source community

Comments

@MigQ2
Copy link
Contributor

MigQ2 commented Sep 14, 2024

Context

Currently, the preferred method of authentication with a datalake or cloud storage when using Databricks is via Unity Catalog and external locations, not directly authenticating to the storage.

If properly configured, when using Databricks or databricks-connect one should be able to use spark to read from cloud storage without explicitly providing a key or direct authentication method with the storage, which makes it safer, more auditable and gives more granular access control

Description

When using Azure and abfss:// paths, the current SparkDataset implementation tries to connect to the storage directly using fsspec and a credential when initializing the Dataset.

Therefore, it forces me to give my kedro project a credential to the abfss:// ADLS.

I want my kedro project to be able to read and write using spark using Unity Catalog external location authentication and not being able to have direct access to the underlying storage

I'm not clear on why SparkDataset needs to initialize the filesystem. It seems to be used later in _load_schema_from_file() but I'm not clear on why this is needed

Possible Implementation

Would it be possible to completely remove all fsspec interactions with the data and make it all via Spark?

@noklam
Copy link
Contributor

noklam commented Sep 14, 2024

@MigQ2
Copy link
Contributor Author

MigQ2 commented Sep 15, 2024

Not directly, because I use external tables and dynamicPartitionOverwrite, which don't seem to be supported.

I could probably create a custom UnityCatalogTableDataset and make it work for me but I feel my use case should be common enough to make it worth to build something everyone can use

I think it would be great to have an opinionated way of easily integrating kedro with the latest Databricks features (Unity Catalog, workflows, external locations, databricks-connect, databricks-hosted mlflow, etc.), as it is the most common ML platform used with kedro (used by 43% of kedro users)

If you have any ideas in mind I can try to help with discussions or implementation

@MinuraPunchihewa
Copy link
Contributor

I think this PR will potentially resolve this?
#827

@noklam
Copy link
Contributor

noklam commented Oct 1, 2024

@MigQ2, I think #827 is a good direction for a few reason:

  • UnityCatalog is still, very much a databricks only thing so it feels right to move it to databricks instead of modifying the generic SparkDataset, I agree there are rooms to align these datasets.
  • As I understand, there are 2 requirements here, authenticate via UnityCatalog & external tables.

if #827 is merged, would that be enough to solve your problem?

@MigQ2
Copy link
Contributor Author

MigQ2 commented Oct 1, 2024

I agree, merging #827 would give me a working solution. Still it would be nice to align both datasets in the future but wouldn't be a blocker

@astrojuanlu astrojuanlu added the Community Issue/PR opened by the open-source community label Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

No branches or pull requests

4 participants