Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not import SGMV kernel from Punica, falling back to loop. #2465

Open
2 of 4 tasks
ksajan opened this issue Aug 28, 2024 · 3 comments
Open
2 of 4 tasks

Could not import SGMV kernel from Punica, falling back to loop. #2465

ksajan opened this issue Aug 28, 2024 · 3 comments

Comments

@ksajan
Copy link

ksajan commented Aug 28, 2024

System Info

text-generation-launcher --env:

2024-08-28T05:17:36.254761Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: 21187c27c90acbec7f912b8af4feaec154de960f
Docker label: N/A
nvidia-smi:
N/A
xpu-smi:
N/A
2024-08-28T05:17:36.254797Z  INFO text_generation_launcher: Args {
    model_id: "bigscience/bloom-560m",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "0.0.0.0",
    port: 3000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: true,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
}

No GPU using CPU version.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Installed rust and create a virtual env with python 3.9
  2. Install Protoc
  3. cloned the github repo
  4. ran the command
cd text-generation-inference/
BUILD_EXTENSIONS=True make install-cpu
  1. then tried the example of running the tgi locally using falcon-7b model but after downloading it fails to load saying the error : Could not import SGMV kernel from Punica, falling back to loop.

Expected behavior

It should download the model and serve it without any error

@github-staff github-staff deleted a comment Aug 28, 2024
@ErikKaum
Copy link
Member

Hi @ksajan 👋

Thanks for filing the issue. I think the problem is that you're running on a CPU and the falcon-7b in TGI is only supported with kernels that require a GPU.

If you want to run TGI locally on cpu to test I'd recommend choosing a smaller model that doesn't rely on special kernel. Or if you're requirements are to use something like falcon-7b then unfortunately you'll need a GPU machine.

Let me know if I can help in any other way 🙌

@ksajan
Copy link
Author

ksajan commented Aug 29, 2024

@ErikKaum I tried running this lmsys/vicuna-7b-v1.3 as well which I can run using llama_cpp. I was trying to actually train the Medusa head that is there in the TGI documentation but I was unable to run this in google collab with GPU with a similar error.

@ErikKaum
Copy link
Member

ErikKaum commented Sep 5, 2024

Yeah so the llama.cpp version probably uses different kernels that don't require GPUs.

When you build this for a gpu did you use: BUILD_EXTENSIONS=True make install-cpu or BUILD_EXTENSIONS=True make`?

I'd nonetheless recommend using the docker image to avoid building from source, usually a lot more hassle free 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants