Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Provisioning continues even if shim failed to pull image #2020

Open
un-def opened this issue Nov 21, 2024 · 1 comment
Open

[Bug]: Provisioning continues even if shim failed to pull image #2020

un-def opened this issue Nov 21, 2024 · 1 comment
Labels
bug Something isn't working no-stale

Comments

@un-def
Copy link
Collaborator

un-def commented Nov 21, 2024

Steps to reproduce

Start a run with small disk storage and a large Docker image, e.g.:

type: task
name: tgi-llama32
image: ghcr.io/huggingface/text-generation-inference:sha-b1f9044

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct

model: Llama-3.2-3B-Instruct

resources:
  gpu: nvidia
  disk: 10GB..

Actual behaviour

dstack apply -f tgi-llama32.dstack.yml --yes
 ...
 #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE
 1  gcp      us-central1  n1-standard-1  1xCPU, 4GB, 1xT4 (16GB), 20.0GB (disk)  no    $0.397712  

tgi-llama32 provisioning completed (failed)
Run failed with error code CREATING_CONTAINER_ERROR.
Error: createContainer error: Error response from daemon: No such image: ghcr.io/huggingface/text-generation-inference:sha-b1f9044
Check CLI, server, and run logs for more details.

shim.log:

2024/11/21 11:04:49 Preparing volumes
2024/11/21 11:04:49 Pulling image
2024/11/21 11:06:26 Error pulling ghcr.io/huggingface/text-generation-inference:sha-b1f9044: failed to register layer: write /opt/conda/lib/libmkl_avx.so.2: no space left on device
2024/11/21 11:06:26 Image Pull interrupted: downloaded 7007466602 bytes out of 7007466602 (68.71MB/s)
2024/11/21 11:06:26 Creating container
2024/11/21 11:06:26 Cleanup routine: Cannot stop container: Error response from daemon: No such container: tgi-llama32-0-0
2024/11/21 11:06:26 Cleanup routine: Cannot remove container: Error response from daemon: No such container: tgi-llama32-0-0
2024/11/21 11:06:26 Creating container tgi-llama32-0-0:
config: &{ ... }
2024/11/21 11:06:26 createContainer error: Error response from daemon: No such image: ghcr.io/huggingface/text-generation-inference:sha-b1f9044
failed Run Error response from daemon: No such image: ghcr.io/huggingface/text-generation-inference:sha-b1f9044

Expected behaviour

No response

dstack version

0.18.26

Server logs

No response

Additional information

No response

@un-def un-def added the bug Something isn't working label Nov 21, 2024
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Dec 22, 2024
@un-def un-def added no-stale and removed stale labels Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working no-stale
Projects
None yet
Development

No branches or pull requests

1 participant