-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat(backend): Add HuggingFace Hub artifact import support. Fixes #12501 #12608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(backend): Add HuggingFace Hub artifact import support. Fixes #12501 #12608
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @Siva-Sainath. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
bc81c53 to
0b4c31a
Compare
| pathParts := strings.Split(parts, "/") | ||
|
|
||
| if len(pathParts) < 2 { | ||
| return fmt.Errorf("invalid HuggingFace URI format: %q, expected huggingface://repo_id/revision", artifactURI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if I want to download a specific file, can I not specify huggingface://repo_id/file_name or if I want to download data set, then repo_type is dataset in that case
Can we add support for that?
Also, not sure if huggingface:// is a standard protocol prefix i.e not sure if its obvious to users that they need to specify uri with this prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest commit, added support for specific datasets or files using huggingface://repo_id/file_name this
or a pattern huggingface://repo_id?allow_patterns=*.safetensors
To let users know that they have to use the URI huggingface://added a error message that details:
return fmt.Errorf("Invalid HuggingFace URI format: %q, expected huggingface://repo_id[/revision]\n"+
"Examples:\n"+
" huggingface://gpt2\n"+
" huggingface://meta-llama/Llama-2-7b\n"+
" huggingface://bert-base-uncased/config.json\n"+
" huggingface://wikitext?repo_type=dataset",
parts[0])
huggingface:// is NOT a standard protocol prefix it was written to match gs://, s3:// URI scheme that was present in the code
| return nil | ||
| } | ||
|
|
||
| func (l *ImportLauncher) downloadHuggingFaceModel(ctx context.Context, repoID, revision, artifactURI string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if I have defined Workspace config with a PVC, will it download to that PVC or will it first upload to S3 and download to PVC from s3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No S3 intermediary, models download directly to the PVC when workspace is configured with it
| return fmt.Errorf("failed to create directory %q: %w", localPath, err) | ||
| } | ||
|
|
||
| cmd := exec.CommandContext(ctx, "python3", "-c", fmt.Sprintf(` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is huggingface_hub sdk getting pip installed, are we expecting that to come as part of packages_to_install. If so, then in the future there may be version conflicts and method signature in those versions could be different, hence snapshot_download can fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The huggingface-hub library is installed in the launcher container image at build time, pinned to version 1.2.4, present in the Dockerfile.launcher file. Tried to ensure version compatibility through dynamic parameter detection using inspect.signature(), which adapts the code to work with any future version of huggingface_hub without requiring code changes
nsingla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some tests to verify that this works:
- Python Pipeline code
- Compiled version
- Argo Compiled Version
- Local Test Runner showing this works
- Add this pipeline to e2e tests
6eca216 to
02a9d2e
Compare
|
@nsingla will work on writing the tests. I would like to clarify, by adding this pipeline to e2e tests, do you mean adding the HF artifact import tests to the existing e2e test runner, or is there a specific test suite/location where I would integrate this? should I commit any of these tests to the PR or simply run and verify if they work or not? |
Signed-off-by: Siva-Sainath <[email protected]>
- Rename artifactUri parameter to artifactURI (ST1003) - Fix error string capitalization (ST1005) Signed-off-by: Siva-Sainath <[email protected]>
Signed-off-by: Siva-Sainath <[email protected]>
02a9d2e to
dc7afd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for importing HuggingFace Hub artifacts (models and datasets) into Kubeflow Pipelines using a new huggingface:// URI scheme. The implementation follows the existing pattern for cloud storage providers (S3, GCS, OCI) and includes secure token management through Kubernetes Secrets.
Changes:
- SDK layer registers the
huggingface://URI scheme and implements local path mapping for HuggingFace artifacts - Backend adds HuggingFace authentication token retrieval from Kubernetes Secrets and injection at runtime
- Import workflow implements HuggingFace-specific download logic using
huggingface_hubPython library with support for repositories, specific files, and query parameters
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/python/kfp/dsl/types/artifact_types.py | Adds huggingface:// URI scheme support and local path mapping between /huggingface/ and huggingface:// URIs |
| sdk/python/kfp/dsl/importer_node.py | Documents the huggingface:// URI scheme with examples and query parameter support |
| backend/src/v2/objectstore/config.go | Adds HuggingFaceParams struct and parsing function for HuggingFace configuration |
| backend/src/v2/objectstore/object_store.go | Implements token retrieval from Kubernetes Secrets and environment variable injection |
| backend/src/v2/component/importer_launcher.go | Implements HuggingFace artifact download logic with URI parsing, query parameter handling, and Python script execution |
| backend/Dockerfile.launcher | Changes base image from alpine:3.19 to python:3.12-slim and installs huggingface-hub==1.2.4 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if token != "" { | ||
| os.Setenv("HF_TOKEN", token) |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using os.Setenv to set HF_TOKEN in a global environment variable (line 83) creates a side effect that persists beyond this function call and could affect other goroutines or operations in the same process. This is a potential concurrency issue in multi-threaded environments.
Instead of using os.Setenv, consider returning the token and having the caller set it in the command's Env field directly (as is done in lines 476 and 513 of importer_launcher.go). This approach is more explicit and avoids global state mutation.
| if token != "" { | |
| os.Setenv("HF_TOKEN", token) | |
| if token == "" { | |
| glog.Warning("Hugging Face provider selected but received an empty token") | |
| } else { | |
| // Token successfully retrieved. The caller is responsible for propagating it | |
| // (for example, by setting HF_TOKEN in a specific command's environment) | |
| // instead of relying on global environment mutation here. | |
| glog.V(1).Info("Hugging Face token retrieved; caller must propagate it explicitly") |
| FROM python:3.12-slim | ||
|
|
||
| RUN pip3 install --no-cache-dir huggingface-hub==1.2.4 && \ | ||
| useradd -M -s /usr/sbin/nologin appuser |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base Docker image was changed from alpine:3.19 to python:3.12-slim, which significantly increases the image size. The alpine:3.19 base image is approximately 7 MB, while python:3.12-slim is around 125-150 MB. This substantial size increase affects deployment times, storage requirements, and network transfer costs.
Consider using a multi-stage build where the Python dependencies are installed in one stage, and then only the necessary files are copied to a minimal runtime image. Alternatively, install Python in an Alpine-based image to maintain a smaller footprint.
| FROM python:3.12-slim | |
| RUN pip3 install --no-cache-dir huggingface-hub==1.2.4 && \ | |
| useradd -M -s /usr/sbin/nologin appuser | |
| FROM alpine:3.19 | |
| RUN apk add --no-cache python3 py3-pip && \ | |
| pip3 install --no-cache-dir huggingface-hub==1.2.4 && \ | |
| adduser -D -H -s /sbin/nologin appuser |
| elif path.startswith(_HUGGINGFACE_LOCAL_MOUNT_PREFIX): | ||
| return (RemotePrefix.HUGGINGFACE.value + | ||
| path[len(_HUGGINGFACE_LOCAL_MOUNT_PREFIX):]) |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new huggingface:// URI scheme support in convert_local_path_to_remote_path lacks test coverage. The existing test file (artifact_types_test.py) has parameterized tests for TestConvertLocalPathToRemotePath that cover gs://, minio://, s3://, and oci:// schemes, but the new huggingface:// scheme is not tested.
Add test cases for huggingface paths, such as:
- '/huggingface/gpt2' -> 'huggingface://gpt2'
- '/huggingface/meta-llama/Llama-2-7b' -> 'huggingface://meta-llama/Llama-2-7b'
| elif self.uri.startswith(RemotePrefix.HUGGINGFACE.value): | ||
| local_path = ( | ||
| _HUGGINGFACE_LOCAL_MOUNT_PREFIX + | ||
| self.uri[len(RemotePrefix.HUGGINGFACE.value):]) |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new huggingface:// URI scheme handling in the _get_path method lacks test coverage. While other URI schemes like gs://, s3://, minio://, and oci:// are covered by tests in artifact_types_test.py, there are no tests verifying that huggingface:// URIs are correctly converted to local paths.
Add test cases to verify the path generation for huggingface URIs, for example:
- artifact.uri = 'huggingface://gpt2' should map to '/huggingface/gpt2'
- artifact.uri = 'huggingface://meta-llama/Llama-2-7b/v1' should map to '/huggingface/meta-llama/Llama-2-7b/v1'
| case "huggingface": | ||
| token, err1 := getHuggingFaceToken(ctx, namespace, config.SessionInfo, k8sClient) | ||
| if err1 != nil { | ||
| return nil, err1 | ||
| } | ||
| if token != "" { | ||
| os.Setenv("HF_TOKEN", token) | ||
| } | ||
| return nil, nil |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The huggingface case in OpenBucket returns (nil, nil) which is inconsistent with other providers that return a bucket object. While this may work for the current use case where HuggingFace downloads are handled specially in importer_launcher.go, returning nil for both the bucket and error could lead to nil pointer dereferences if the caller expects a valid bucket object.
Consider either returning a sentinel error indicating that HuggingFace doesn't use the bucket interface, or documenting this special behavior more clearly. The current implementation could be error-prone if future code tries to use this function for HuggingFace URIs without understanding this special case.
| // GetHuggingFaceTokenFromSessionInfo extracts HF token from SessionInfo | ||
| func GetHuggingFaceTokenFromSessionInfo(ctx context.Context, k8sClient kubernetes.Interface, namespace string, sessionInfo *SessionInfo) (string, error) { | ||
| if sessionInfo == nil { | ||
| return "", nil | ||
| } | ||
| return getHuggingFaceToken(ctx, namespace, sessionInfo, k8sClient) | ||
| } |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name GetHuggingFaceTokenFromSessionInfo is exported (starts with capital letter) and documented as "extracts HF token from SessionInfo", but it's actually a wrapper around the private getHuggingFaceToken function that requires both a Kubernetes client and namespace. The name doesn't clearly indicate that it accesses Kubernetes secrets.
Consider renaming to something more descriptive like GetHuggingFaceTokenFromK8sSecret or documenting the function more thoroughly to indicate it retrieves the token from a Kubernetes Secret based on SessionInfo configuration.
| if queryStr != "" { | ||
| params := strings.Split(queryStr, "&") | ||
| for _, param := range params { | ||
| kv := strings.Split(param, "=") | ||
| if len(kv) == 2 { | ||
| switch kv[0] { | ||
| case "repo_type": | ||
| repoType = kv[1] | ||
| case "allow_patterns": | ||
| allowPatterns = kv[1] | ||
| case "ignore_patterns": | ||
| ignorePatterns = kv[1] | ||
| } | ||
| } | ||
| } | ||
| } |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query parameter parsing uses a simple string split approach (lines 372-385) which doesn't handle URL encoding. If users provide query parameters with special characters (e.g., "allow_patterns=.bin%2C.safetensors" for ".bin,.safetensors"), the percent-encoded values won't be decoded properly.
Consider using a proper URL query parser like url.ParseQuery or document that query parameter values must not be URL-encoded.
| FROM python:3.12-slim | ||
|
|
||
| RUN pip3 install --no-cache-dir huggingface-hub==1.2.4 && \ | ||
| useradd -M -s /usr/sbin/nologin appuser |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The useradd command uses different flags between Alpine and Debian-based images. In Alpine (removed code), "-S" creates a system user, and "-s /usr/sbin/nologin" sets the shell. In Debian (new code line 33), "-M" prevents home directory creation, which is correct. However, the old Alpine version didn't prevent home directory creation explicitly with "-H".
The new command is correct for the python:3.12-slim (Debian-based) image. Ensure this change is intentional and properly tested.
…ms]" Co-authored-by: Copilot <[email protected]> Signed-off-by: Bandi Siva Sainath Reddy <[email protected]>
…supported by the installed version of huggingface-hub Co-authored-by: Copilot <[email protected]> Signed-off-by: Bandi Siva Sainath Reddy <[email protected]>
| RUN GO111MODULE=on CGO_ENABLED=0 GOOS=linux go build -tags netgo -ldflags '-extldflags "-static"' -o /bin/launcher-v2 ./backend/src/v2/cmd/launcher-v2/*.go | ||
|
|
||
| FROM alpine:3.19 | ||
| FROM python:3.12-slim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with copilot
Also 3.12 is not the officially supported python version even though it works. And we still need to support python 3.9 for backwards compatibility
Addresses the issue #12501,
Implements secure HuggingFace model artifact import via the standard DSL importer decorator pattern (artifact_uri='huggingface://repo-id'), consistent with S3/GCS/OCI implementations.
Extracts HuggingFace authentication token from Kubernetes Secret metadata (through HuggingFaceParams in pipeline spec), injects it at runtime as HF_TOKEN environment variable in pod context and downloads models using func huggingface_hub.snapshot_download() to location /huggingface//.
SDK Layer (artifact_types.py): Registers huggingface:// URI scheme for importer decorator
Backend Config (config.go): HuggingFaceParams struct with func StructuredHuggingFaceParams() for parsing secret metadata
Token Injection (object_store.go): getHuggingFaceToken() retrieves token from Secret and sets HF_TOKEN env variable
Import Workflow (importer_launcher.go): handleHuggingFaceImport() takes care of artifact download with injected credentials
Token remains in Kubernetes Secret and is injected at pod execution time—never embedded in pipeline code.
Checklist: