Skip to content

Conversation

@agreaves-ms
Copy link
Collaborator

@agreaves-ms agreaves-ms commented Dec 11, 2025

Description

Adds Azure AKS Workload Identity support, allowing pods to authenticate to Azure Blob Storage without connection strings or managed secrets. Uses DefaultAzureCredential for token-based auth and updates Helm charts to support workload identity configuration.

Related to Issue #153

Changes

Helm charts

  • all charts (service, router, backend-operator, web-ui) now support ServiceAccount configuration:
    • serviceAccount.create - toggle chart-managed ServiceAccount creation
    • serviceAccount.name - specify pre-provisioned SA name
    • serviceAccount.annotations - add workload identity annotations (e.g., azure.workload.identity/client-id)
  • added extraPodLabels to services for workload identity labels (azure.workload.identity/use: "true")
  • made imagePullSecrets optional when using workload identity

Azure storage backend

  • AzureBlobStorageClient supports both connection string and token-based auth
  • added DefaultAzureCredential support for workload identity
  • SAS token generation now uses user delegation keys instead of account keys
  • credential fields (access_key_id, access_key) are now optional

Credentials

  • BasicDataCredential fields are now optional
  • added get_access_key_value() for safe credential retrieval
  • updated callers to use new accessor pattern

PyInstaller

  • added azure.identity, msal, and msal_extensions to Azure hooks for CLI workload identity support

Tests

  • added test coverage for workload identity auth paths

Usage

serviceAccount:
  create: false
  name: "my-workload-identity-sa"
  annotations:
    azure.workload.identity/client-id: "<client-id>"

services:
  service:
    extraPodLabels:
      azure.workload.identity/use: "true"

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@agreaves-ms agreaves-ms requested a review from a team December 11, 2025 21:03
@fernandol-nvidia
Copy link
Contributor

This is awesome, @agreaves-ms! Thank you so much for looking into this.

The team will be going through this change and thinking through what else we might need to change. Some high level thoughts is we probably need to give you more "types" of Credentials (e.g. WorkloadIdentityCredential, DataAccessCredential, ConnectionStringCredential, etc).

Another thing we will look at as a team is changes necessary to our service and osmo-ctrl to support this. Please stay tuned! And again, really appreciate the great work so far!

@agreaves-ms
Copy link
Collaborator Author

This is awesome, @agreaves-ms! Thank you so much for looking into this.

The team will be going through this change and thinking through what else we might need to change. Some high level thoughts is we probably need to give you more "types" of Credentials (e.g. WorkloadIdentityCredential, DataAccessCredential, ConnectionStringCredential, etc).

Another thing we will look at as a team is changes necessary to our service and osmo-ctrl to support this. Please stay tuned! And again, really appreciate the great work so far!

This is a great idea, I wanted to try and make a PR that was small and focused without refactoring to much of your codebase. Mainly to get the ideas through. Having a new credential type would be the most ideal, I'm needing to also alter the client_config.py to support DataSet upload and also DataSet download from both the osmo cli (if I'm a user running locally with az login done and I have the proper roles to be able to access the Storage Accoutn and Blob container) and then also from the osmo service kicking off the workload (needing to download the DataSet input data or upload the output data to the DataSet).

I'll update this PR with that change as well as a separate commit and then point you at it.

I was trying to think through a more agnostic naming scheme for this since Workload Identity is really an Azure and GCP term whereas AWS uses something like IRSA. I was thinking of going with EnvironmentAuthCredential or something similar... Anyways, you'll see that naming in this commit that I have.

Also, feel free to not take this PR, and instead implement the logic however way makes the most sense for your teams architecture that you've planned out. Thank you again for looking into this!

@agreaves-ms
Copy link
Collaborator Author

@fernandol-nvidia I've pushed an extra commit to further address workload identity support for input and output datasets, both on upload from the osmo-cli and then running in the workflow. Please take a look at that as well when considering implementation. I've now fully tested this in my own environment and I'm able to use my authentication with az login before doing an osmo workflow submit and it uses my credentials and access to the storage account to upload the dataset, then in the running pod for the workflow it's using the workload identity that's there to pull in the dataset in the blob.

Feel free to use this work however your team needs. I'm happy to further test any implementation that you all make available as well.

@fernandol-nvidia
Copy link
Contributor

Hey @agreaves-ms, thanks for the update and apologies for the delay. I spent the past couple of days doing a refactor of how OSMO data operations work w.r.t to different credential strategies. I have a pending PR at #159.

The tl;dr is that we are differentiating DataCredential into StaticDataCredential (what we currently have) and DefaultDataCredential (which we delegate to the underlying SDK on credential resolution).

I believe this would give you the necessary abstractions/interface to add workload identity support for Azure. More specifically:

Defining the behavior at

def create_client(data_cred: credentials.DataCredential) -> blob.BlobServiceClient:
"""
Creates a new Azure Blob Storage client.
"""
match data_cred:
case credentials.StaticDataCredential():
return blob.BlobServiceClient.from_connection_string(
conn_str=data_cred.access_key.get_secret_value(),
)
case credentials.DefaultDataCredential():
raise NotImplementedError(
'Default data credentials are not supported yet')
case _ as unreachable:
assert_never(unreachable)

And overriding this method on how to decide if workload identity should be available for a particular backend:

@functools.cached_property
def resolved_data_credential(self) -> credentials.DataCredential:
"""
Resolve the data credential for the storage backend.
Subclasses should override this method to provide custom credential resolution logic.
Returns:
The resolved data credential.
Raises:
OSMOCredentialError: If the data credential is not found.
"""
data_cred = credentials.get_static_data_credential_from_config(self.profile)
if data_cred is None:
raise osmo_errors.OSMOCredentialError(
f'Data credential not found for {self.profile}')
return data_cred

Hopefully, this frees you from having to update so many unrelated files just to get Azure workload identity working :)

Please feel free to add any feedback/suggestions that would be helpful for you. I will keep you updated as things progress and can help you rebase this PR to absorb the new changes.

- implement service account creation with customizable name and annotations
- enhance service templates to support extra pod labels for various services
- update Azure backend to utilize DefaultAzureCredential for authentication
- add tests for Azure credential extraction and client creation
…Storage

- add function to extract AccountKey from connection string
- update AzureBlobStorageClient to handle different credential types
@agreaves-ms agreaves-ms force-pushed the feat/azure-aks-workload-identity branch from f2d996c to 6bd2877 Compare January 12, 2026 06:07
@agreaves-ms
Copy link
Collaborator Author

agreaves-ms commented Jan 12, 2026

@fernandol-nvidia I've update this PR with the changes that have since been added. I needed to make some additional changes to parts that would allow for DataCredential instead of StaticDataCredential (only) as well.

Please review, I built this and verified all the azure workload identity, RBAC, and local auth works with these changes. Feel free to take over this PR as well as needed.

…TaskGroup

- change StaticDataCredential to DataCredential in get_all_data_creds method
- update fetch_creds function signature to use DataCredential
Copy link
Contributor

@fernandol-nvidia fernandol-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy 2026 :)

Nice, this looks great! I've added some minor comments which will help slim down the changes necessary.

Given that you've already tested this, I feel pretty confident with the PR. Once comments are addressed, I will approve!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this file unchanged because non-staticdatacredentials doesn't need to be persisted into DB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this file unchanged for similar reasons. Non-staticdatacredential won't be encoded in the task.

raise ValueError('AccountKey not found in connection string')


def create_client(data_cred: credentials.DataCredential) -> blob.BlobServiceClient:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass in the params (storage_account, account_url) from the callers (instead of deriving it here). This is because the callers (i.e. an AzureBlobStorageBackend) will have:

  1. storage_account (https://github.com/NVIDIA/OSMO/blob/main/src/lib/data/storage/backends/backends.py#L822-L825)
  2. endpoint
    https://github.com/NVIDIA/OSMO/blob/main/src/lib/data/storage/backends/common.py#L164-L168
    https://github.com/NVIDIA/OSMO/blob/main/src/lib/data/storage/backends/backends.py#L865-L868

The callsites of create_client are:

https://github.com/NVIDIA/OSMO/blob/main/src/lib/data/storage/backends/backends.py#L911

and

https://github.com/NVIDIA/OSMO/blob/main/src/lib/data/storage/backends/azure.py#L838.

For the second one, we can add the necessary params to AzureBlobStorageClientFactory from (

return azure.AzureBlobStorageClientFactory(data_cred=data_cred)
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants