Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5484ba0
feat: Implement elastic training cli arguments (#273)
yungwenh-aws Nov 15, 2025
648c083
Revert "feat: Implement elastic training cli arguments (#273)"
mollyheamazon Nov 15, 2025
99c4705
Add dev_space_constants.py (#255)
brianFruit Oct 21, 2025
6c21956
Add dev_space_access_constants.py (#256)
brianFruit Oct 22, 2025
47bdccc
Add space_admin_config_constants.py (#257)
brianFruit Oct 22, 2025
d2b76fa
Add template package only (#261)
brianFruit Oct 23, 2025
b8f7333
Add dev_space.py CLI command (#263)
brianFruit Oct 23, 2025
fd7e644
Add dev_space_utils.py to work with the dev space template model (#262)
brianFruit Oct 23, 2025
ca486b7
Add dev space CLI (#269)
aws-brianxia Oct 28, 2025
1568d31
Rename dev space to space (#272)
aws-brianxia Oct 28, 2025
5146307
Update the Space model and constants per latest operator (#275)
aws-brianxia Oct 30, 2025
046d4c8
Add space_admin_config.py CLI command (#260)
brianFruit Oct 30, 2025
3010df1
Implement CRUD operations for Space PySDK (#267)
aws-brianxia Nov 3, 2025
13f1c0c
Implement the pySDK for the Space Template (#282)
aws-brianxia Nov 4, 2025
8fc7277
Refactor Space CLI using the Space PySDK (#281)
aws-brianxia Nov 6, 2025
51a6415
Add dev_space_access.py CLI command (#259)
brianFruit Nov 7, 2025
24171e6
Listing space will filter out the spaces not created by the current u…
aws-brianxia Nov 7, 2025
c008e92
Refactor space template with PySDK (#286)
aws-brianxia Nov 12, 2025
bba82d7
Add additional Space parameters for resources including the fractiona…
aws-brianxia Nov 12, 2025
5dffc80
Implement validation for mig profiles for Spaces (#291)
aws-brianxia Nov 18, 2025
0afcec1
Parker GA issues (#296)
aws-brianxia Nov 20, 2025
75affc2
Fix the template ref regression (#300)
aws-brianxia Nov 20, 2025
84ff4b3
Update SageMaker Space documentation (#301)
aws-brianxia Nov 21, 2025
2ce7ab0
Implement Space integration tests (#298)
aws-brianxia Nov 21, 2025
caf618d
merge conflicts fixed
Nov 21, 2025
1098ff2
Update README for fractional gpu support (#294)
oyangz Nov 19, 2025
f9a6582
merge conflicts from js template and inference
Nov 21, 2025
a82841b
update changelog
Nov 21, 2025
0c1cc04
uncommented install req
Nov 21, 2025
8876288
uncommented
Nov 21, 2025
6455f6a
fixed uncomment
Nov 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Changelog

## v.3.4.0 (2025-11-20)

### Features

* HyperPod Dev Spaces template for data scientists to create, manage, and access interactive ML development environments with configurable resource allocation and namespace isolation
* Support for KVCaching, intelligent routing, tiered storage, MIG
* Support for fractional gpu
* Support KVCache and Intelligent Routing support in template version 1.1
* User can modify jinja template to add parameters supported by CRD through init experience, for further CLI customization
* MIG support for model deployment on SageMaker Hyperpod Inference


## v.3.3.1 (2025-10-30)

### Features
Expand Down
296 changes: 296 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,12 @@ Note: Old `hyperpod`CLI V2 has been moved to `release_v2` branch. Please refer [
- [Inference](#inference)
- [Jumpstart Endpoint](#jumpstart-endpoint-creation)
- [Custom Endpoint](#custom-endpoint-creation)
- [Space](#space)
- [SDK](#sdk)
- [Cluster Management](#cluster-management-sdk)
- [Training](#training-sdk)
- [Inference](#inference-sdk)
- [Space](#space-sdk)
- [Examples](#examples)


Expand Down Expand Up @@ -300,6 +302,37 @@ hyp create hyp-pytorch-job \
--volume name=training-output,type=pvc,mount_path=/data2,claim_name=my-pvc,read_only=false
```

**Example with accelerator parititons:**

```bash
hyp create hyp-pytorch-job \
--version 1.1 \
--job-name test-pytorch-job \
--image pytorch/pytorch:latest \
--command '[python, train.py]' \
--args '[--epochs=10, --batch-size=32]' \
--environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
--pull-policy "IfNotPresent" \
--instance-type ml.p4d.24xlarge \
--tasks-per-node 8 \
--label-selector '{"accelerator": "nvidia", "network": "efa"}' \
--deep-health-check-passed-nodes-only true \
--scheduler-type "kueue" \
--queue-name "training-queue" \
--priority "high" \
--max-retry 3 \
--accelerator-partition-type "mig-1g.5gb" \
--accelerator-partition-count 2 \
--accelerator-partition-limit 4 \
--vcpu 96.0 \
--memory 1152.0 \
--vcpu-limit 96.0 \
--memory-limit 1152.0 \
--preferred-topology "topology.kubernetes.io/zone=us-west-2a" \
--volume name=model-data,type=hostPath,mount_path=/data,path=/data \
--volume name=training-output,type=pvc,mount_path=/data2,claim_name=my-pvc,read_only=false
```

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `--job-name` | TEXT | Yes | Unique name for the training job (1-63 characters, alphanumeric with hyphens) |
Expand All @@ -326,10 +359,21 @@ hyp create hyp-pytorch-job \
| `--accelerators-limit` | INTEGER | No | Limit for the number of accelerators a.k.a GPUs or Trainium Chips |
| `--vcpu-limit` | FLOAT | No | Limit for the number of vCPUs |
| `--memory-limit` | FLOAT | No | Limit for the amount of memory in GiB |
| `--accelerator-partition-type` | TEXT | No | Type of accelerator partition (e.g., mig-1g.5gb, mig-2g.10gb, mig-3g.20gb, mig-4g.20gb, mig-7g.40gb) |
| `--accelerator-partition-count` | INTEGER | No | Number of accelerator partitions to request (minimum: 1) |
| `--accelerator-partition-limit` | INTEGER | No | Limit for the number of accelerator partitions (minimum: 1) |
| `--preferred-topology` | TEXT | No | Preferred topology annotation for scheduling |
| `--required-topology` | TEXT | No | Required topology annotation for scheduling |
| `--debug` | FLAG | No | Enable debug mode (default: false) |

#### List Available Accelerator Partition Types

This command lists the available accelerator partition types on the cluster for a specific instance type.

```bash
hyp list-accelerator-partition-type --instance-type <instance-type>
```

#### List Training Jobs

```bash
Expand Down Expand Up @@ -614,6 +658,105 @@ hyp get-operator-logs hyp-custom-endpoint --since-hours 0.5
hyp delete hyp-custom-endpoint --name endpoint-custom
```

### Space

#### Create a Space

```bash
hyp create hyp-space \
--name myspace \
--namespace default \
--display-name "My Space"
```

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `--name` | TEXT | Yes | Space name |
| `--display-name` | TEXT | Yes | Display Name of the space |
| `--namespace` | TEXT | No | Kubernetes namespace |
| `--image` | TEXT | No | Image specifies the container image to use |
| `--desired-status` | TEXT | No | DesiredStatus specifies the desired operational status |
| `--ownership-type` | TEXT | No | OwnershipType specifies who can modify the space. 'Public' means anyone with RBAC permissions can update/delete the space. 'OwnerOnly' means only the creator can update/delete the space. |
| `--node-selector` | TEXT | No | NodeSelector specifies node selection constraints for the space pod (JSON string) |
| `--affinity` | TEXT | No | Affinity specifies node affinity and anti-affinity rules for the space pod (JSON string) |
| `--tolerations` | TEXT | No | Tolerations specifies tolerations for the space pod to schedule on nodes with matching taints (JSON string) |
| `--lifecycle` | TEXT | No | Lifecycle specifies actions that the management system should take in response to container lifecycle events (JSON string) |
| `--app-type` | TEXT | No | AppType specifies the application type for this workspace |
| `--service-account-name` | TEXT | No | ServiceAccountName specifies the name of the ServiceAccount to use for the workspace pod |
| `--idle-shutdown` | TEXT | No | Idle shutdown configuration. Format: --idle-shutdown enabled=<bool>,idleTimeoutInMinutes=<int>,detection=<JSON string> |
| `--template-ref` | TEXT | No | TemplateRef references a WorkspaceTemplate to use as base configuration. Format: --template-ref name=<name>,namespace=<namespace> |
| `--container-config` | TEXT | No | Container configuration. Format: --container-config command=<cmd>,args=<arg1;arg2> |
| `--storage` | TEXT | No | Storage configuration. Format: --storage storageClassName=<class>,size=<size>,mountPath=<path> |
| `--volume` | TEXT | No | Volume configuration. Format: --volume name=<name>,mountPath=<path>,persistentVolumeClaimName=<pvc_name>. Use multiple --volume flags for multiple volumes. |
| `--accelerator-partition-count` | TEXT | No | Fractional GPU partition count, e.g. '1' |
| `--accelerator-partition-type` | TEXT | No | Fractional GPU partition type, e.g. 'mig-3g.20gb' |
| `--gpu-limit` | TEXT | No | GPU resource limit, e.g. '1' |
| `--gpu` | TEXT | No | GPU resource request, e.g. '1' |
| `--memory-limit` | TEXT | No | Memory resource limit, e.g. '2Gi' |
| `--memory` | TEXT | No | Memory resource request, e.g. '2Gi' |
| `--cpu-limit` | TEXT | No | CPU resource limit, e.g. '500m' |
| `--cpu` | TEXT | No | CPU resource request, e.g. '500m' |

#### List Spaces

```bash
hyp list hyp-space
```

#### Describe a Space

```bash
hyp describe hyp-space --name myspace
```

#### Update a Space

```bash
hyp update hyp-space \
--name myspace \
--display-name "Updated Space Name"
```

#### Start/Stop a Space

```bash
hyp start hyp-space --name myspace
hyp stop hyp-space --name myspace
```

#### Get Logs

```bash
hyp get-logs hyp-space --name myspace
```

#### Delete a Space

```bash
hyp delete hyp-space --name myspace
```

#### Space Template Management

Create reusable space templates:

```bash
hyp create hyp-space-template --file template.yaml
hyp list hyp-space-template
hyp describe hyp-space-template --name <template-name>
hyp update hyp-space-template --name <template-name> --file updated-template.yaml
hyp delete hyp-space-template --name <template-name>
```

#### Space Access

Create remote access to spaces:

```bash
hyp create hyp-space-access --name myspace --connection-type vscode-remote
hyp create hyp-space-access --name myspace --connection-type web-ui
```

## SDK

Along with the CLI, we also have SDKs available that can perform the cluster management, training and inference functionalities that the CLI performs
Expand Down Expand Up @@ -993,6 +1136,159 @@ from sagemaker.hyperpod.observability.utils import get_monitoring_config
monitor_config = get_monitoring_config()
```

### Space SDK

#### Creating a Space

```python
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
from hyperpod_space_template.v1_0.model import SpaceConfig

# Create space configuration
space_config = SpaceConfig(
name="myspace",
namespace="default",
display_name="My Space",
)

# Create and start the space
space = HPSpace(config=space_config)
space.create()
```

#### List Spaces

```python
from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# List all spaces in default namespace
spaces = HPSpace.list()
for space in spaces:
print(f"Space: {space.config.name}, Status: {space.status}")

# List spaces in specific namespace
spaces = HPSpace.list(namespace="your-namespace")
```

#### Get a Space

```python
from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get specific space
space = HPSpace.get(name="myspace", namespace="default")
print(f"Space name: {space.config.name}")
print(f"Display name: {space.config.display_name}")
```

#### Update a Space

```python
from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get existing space
space = HPSpace.get(name="myspace")

# Update space configuration
space.update(
display_name="Updated Space Name",
)
```

#### Start/Stop a Space

```python
from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get existing space
space = HPSpace.get(name="myspace")

# Start the space
space.start()

# Stop the space
space.stop()
```

#### Get Space Logs

```python
from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get space and retrieve logs
space = HPSpace.get(name="myspace")

# Get logs from default pod and container
logs = space.get_logs()
print(logs)
```

#### List Space Pods

```python
from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get space and list associated pods
space = HPSpace.get(name="myspace")
pods = space.list_pods()
for pod in pods:
print(f"Pod: {pod}")
```

#### Create Space Access

```python
from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get existing space
space = HPSpace.get(name="myspace")

# Create VS Code remote access
vscode_access = space.create_space_access(connection_type="vscode-remote")
print(f"VS Code URL: {vscode_access['SpaceConnectionUrl']}")

# Create web UI access
web_access = space.create_space_access(connection_type="web-ui")
print(f"Web UI URL: {web_access['SpaceConnectionUrl']}")
```

#### Delete a Space

```python
from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get existing space
space = HPSpace.get(name="myspace")

# Delete the space
space.delete()
```

#### Space Template Management

```python
from sagemaker.hyperpod.space.hyperpod_space_template import HPSpaceTemplate

# Create space template from YAML file
template = HPSpaceTemplate(file_path="template.yaml")
template.create()

# List all space templates
templates = HPSpaceTemplate.list()
for template in templates:
print(f"Template: {template.name}")

# Get specific space template
template = HPSpaceTemplate.get(name="my-template")
print(template.to_yaml())

# Update space template
template.update(file_path="updated-template.yaml")

# Delete space template
template.delete()
```

## Examples
#### Cluster Management Example Notebooks

Expand Down
12 changes: 10 additions & 2 deletions doc/cli/cli_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,11 @@ Complete reference for the SageMaker HyperPod Command Line Interface.
cluster_management/cli_cluster_management
training/cli_training
inference/cli_inference
space/cli_space

.. container::

.. grid:: 1 1 3 3
.. grid:: 1 1 4 4
:gutter: 3

.. grid-item-card:: Cluster Management CLI
Expand All @@ -35,4 +36,11 @@ Complete reference for the SageMaker HyperPod Command Line Interface.
:link-type: doc
:class-card: sd-border-secondary

Inference CLI commands, options and parameters.
Inference CLI commands, options and parameters.

.. grid-item-card:: Space CLI
:link: space/cli_space
:link-type: doc
:class-card: sd-border-secondary

Space management commands, options and parameters.
Loading
Loading