Skip to content

Add benchmark results for GLM-4.6 quantized models#4813

Open
vicruz99 wants to merge 1 commit intoAider-AI:mainfrom
vicruz99:benchmark-glm4.6-results-quantized
Open

Add benchmark results for GLM-4.6 quantized models#4813
vicruz99 wants to merge 1 commit intoAider-AI:mainfrom
vicruz99:benchmark-glm4.6-results-quantized

Conversation

@vicruz99
Copy link

@vicruz99 vicruz99 commented Feb 7, 2026

Benchmark Results: GLM-4.6 (Unsloth Quantized Versions)

This PR adds benchmark results for the Unsloth quantized versions of GLM-4.6 model.

Model Details

  • Model: GLM-4.6 (Unsloth GGUF quantizations)
  • Quantizations tested:
    • Q5_K_XL (5-bit)
    • Q3_K_XL (3-bit)
    • Q2_K_XL (2-bit)
  • Inference engine: llama.cpp server
  • Hardware: 4x A100 GPUs (local)

Configuration

Models were served locally using llama.cpp with the following settings:

./llama.cpp/bin_guadiana/llama-server \
    --model /scratch/vicstorage/UD-Q3_K_XL/GLM-4.6-UD-Q3_K_XL-00001-of-00004.gguf \
    --jinja -ngl 99 --threads -1 --ctx-size 65536 \
    --temp 1.0 --top-p 0.95 --top-k 40 --prio 3 \
    --host 0.0.0.0 --port 8080 \
    -kvu --cache-ram 0

Key parameters:

  • Context size: 65536 tokens
  • Temperature: 1.0
  • Top-p: 0.95
  • Top-k: 40

Note on Timeout

These benchmarks were ran with an increased timeout setting. A companion PR (#4650) adds an option to set custom request timeouts in benchmark.py. For these runs, the timeout was set to 10000ms in liteLLM configuration.

Results

The benchmark results are included in the data files updated in this PR.

@CLAassistant
Copy link

CLAassistant commented Feb 7, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants