Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: SYCL builds >= b4069 have half the context limit of previous builds #10421

Open
0xDEADFED5 opened this issue Nov 20, 2024 · 4 comments
Open
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)

Comments

@0xDEADFED5
Copy link

0xDEADFED5 commented Nov 20, 2024

What happened?

found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |
    |
|  |                   |                                       |       |compute|Max work|sub  |mem    |
    |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.5|    512|    1024|   32| 16704M|            1.3.31093|
llama_kv_cache_init:      SYCL0 KV buffer size =  3937.50 MiB
llama_new_context_with_model: KV self size  = 3937.50 MiB, K (f16): 1968.75 MiB, V (f16): 1968.75 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =    27.82 MiB
ggml_backend_sycl_buffer_type_alloc_buffer: can't malloc 3278637056 Bytes memory on deviceggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 7573604352
ggml_backend_sycl_buffer_type_alloc_buffer: can't malloc 3278637056 Bytes memory on deviceggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 7573604352
llama_new_context_with_model: failed to allocate compute buffers
common_init_from_params: failed to create context with model 'C:\LLM\Qwen2.5-3B-Instruct_Q4_1.gguf'
srv    load_model: failed to load model, 'C:\LLM\Qwen2.5-3B-Instruct_Q4_1.gguf'
main: exiting due to model loading error

command line: llama-server.exe -t 16 --threads-http 8 --mlock -ngl 99 -m C:\LLM\Qwen2.5-3B-Instruct_Q4_1.gguf --port 8888 --ctx-size 112000 -np 48 --sampling-seq mt --min-p 0.1 --temp 1.5 -dt .1

command line works fine on builds < b4069

i have to lower context all the way to 60k with b4069

GPU: Intel Arc A770 (16GB)
OS: Windows

Name and Version

ZE_LOADER_DEBUG_TRACE:Using Loader Library Path:
ZE_LOADER_DEBUG_TRACE:Tracing Layer Library Path: ze_tracing_layer.dll
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
version: 4069 (2e82ffa)
built with MSVC 19.41.34123.0 for

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

@0xDEADFED5 0xDEADFED5 added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Nov 20, 2024
@NeoZhangJianyu
Copy link
Collaborator

The parameters lead to more memory requirement.
Please reduce --ctx-size and --batch-size values.

And make sure there is no other processes to use GPU memory.

@qnixsynapse
Copy link
Contributor

qnixsynapse commented Nov 21, 2024

@NeoZhangJianyu This error message, I think needs to be more user friendly than using very technical terms such as malloc. I am thinking of opening a PR.

edit: Also, have to change into GGML_LOG system.

@gghfez
Copy link

gghfez commented Nov 22, 2024

Same issue here on Linux. ended up reverting to b4068 for now since the latest version fails to build (most likely because of the oneapi upgrade requirement 2024 -> 2025).

@0xDEADFED5
Copy link
Author

0xDEADFED5 commented Nov 25, 2024

The parameters lead to more memory requirement. Please reduce --ctx-size and --batch-size values.

And make sure there is no other processes to use GPU memory.

the batch size was a mistake, i don't use that parameter anymore. (though it was close to the default of 2048 for whatever strange reason)

i need that much context however, as i'm running highly parallel. I have many millions of requests so far at that context size, it works fine in builds < 4069.

i could live with slightly less context, but the new context limit is about 50% that of previous builds, so obviously something is wrong, no?

@0xDEADFED5 0xDEADFED5 changed the title Bug: SYCL builds >= b4069 fail to allocate SYCL0 buffer Bug: SYCL builds >= b4069 have half the context limit of previous builds Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)
Projects
None yet
Development

No branches or pull requests

4 participants