Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenChat, Alpaca, Vicuna chat templates #6397

Merged
merged 16 commits into from
Apr 3, 2024
Merged

Conversation

kaizau
Copy link
Contributor

@kaizau kaizau commented Mar 30, 2024

This PR adds chat templates for some of the more popular non-ChatML models (that I know of, at least!).

Named openchat, vicuna, and alpaca respectively.

I based OpenChat's on the official Jinja template, and Vicuna's on the one from text-generation-web-ui (couldn't find it in any model's config_tokenizer.json, but it matches what I saw in model cards and HF discussions). Alpaca was done using DeepSeek's template since the original also predates Jinja chat templates.

Caveat: Because none of the Vicuna models I've tested seem to include a chat template string, there doesn't seem to be a good way to heuristically detect the Orca variant. I've worked around this by creating a vicuna-orca template that's also handled by vicuna. Open to alternatives here.

New to C++ and this project, so please don't hesitate to mention any details I may have missed!

@kaizau kaizau changed the title Add OpenChat, Starling, Vicuna chat template support Add OpenChat, Alpaca, Vicuna chat templates Mar 30, 2024
@ggerganov ggerganov requested a review from ngxson March 30, 2024 12:27
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also test to see if the output really matches with the python version of these template? You can use the python code here: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template

Please also let me know what must be added to the wiki page.

llama.cpp Outdated
} else if (role == "user") {
ss << "### Instruction:\n" << message->content << "\n\n";
} else if (role == "assistant") {
ss << "### Response:\n" << message->content << "\n\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alpaca template and deepseek template both look similar at the first glance, but the main different is that alpaca template only used for instruction-response (one turn) and not multiple turns like modern chat template.

deepseek extends the notion of instruction-response into multi-turn by placing <|EOT|> token between each turn, so the formatted chat should look like:

### Instruction:
who are you?
### Response:
I am assistant
<|EOT|>
### Instruction:
1+1 is
### Response:
equal to 2
<|EOT|>

So what missing here is that <|EOT|>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chat above is produced by python code + jinja template, it doesn't seem to have "\n\n" at the end of each message, so I think the "\n\n" should be replaced by "\n"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the python script! Included the Jinja output of OpenChat and DeepSeek below. And as you mentioned, the other two fail due to not having templates in config_tokenizers.json.

Will add <|EOT|> for DeepSeek when I have moment tomorrow.

openchat/openchat-3.5-0106
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<s>GPT4 Correct System: You are a helpful assistant<|end_of_turn|>GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi there<|end_of_turn|>GPT4 Correct User: Who are you<|end_of_turn|>GPT4 Correct Assistant:    I am an assistant   <|end_of_turn|>GPT4 Correct User: Another question<|end_of_turn|>
------------------------------
deepseek-ai/deepseek-coder-33b-instruct
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|begin▁of▁sentence|>You are a helpful assistant### Instruction:
Hello
### Response:
Hi there
<|EOT|>
### Instruction:
Who are you
### Response:
   I am an assistant   
<|EOT|>
### Instruction:
Another question

------------------------------

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, just a quick note is that <|begin▁of▁sentence|> is not needed, because BOS is always added on server

tests/test-chat-template.cpp Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
@Jeximo
Copy link
Contributor

Jeximo commented Mar 30, 2024

Mistral Instruct may be good for templating, https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

Copy link
Contributor

github-actions bot commented Apr 1, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 492 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=9577.92ms p(90)=26590.07ms fails=0, finish reason: stop=492 truncated=0
  • Prompt processing (pp): avg=244.14tk/s p(90)=741.1tk/s total=195.69tk/s
  • Token generation (tg): avg=99.73tk/s p(90)=272.37tk/s total=131.07tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=021c6f50e1d354cb95a8187d4f6dd5b40f7e329f
Time series

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 492 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712152788 --> 1712153414
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 578.07, 578.07, 578.07, 578.07, 578.07, 627.44, 627.44, 627.44, 627.44, 627.44, 647.2, 647.2, 647.2, 647.2, 647.2, 659.72, 659.72, 659.72, 659.72, 659.72, 684.64, 684.64, 684.64, 684.64, 684.64, 685.76, 685.76, 685.76, 685.76, 685.76, 685.92, 685.92, 685.92, 685.92, 685.92, 690.43, 690.43, 690.43, 690.43, 690.43, 690.49, 690.49, 690.49, 690.49, 690.49, 695.09, 695.09, 695.09, 695.09, 695.09, 713.81, 713.81, 713.81, 713.81, 713.81, 720.54, 720.54, 720.54, 720.54, 720.54, 740.53, 740.53, 740.53, 740.53, 740.53, 748.95, 748.95, 748.95, 748.95, 748.95, 697.43, 697.43, 697.43, 697.43, 697.43, 699.98, 699.98, 699.98, 699.98, 699.98, 704.39, 704.39, 704.39, 704.39, 704.39, 701.68, 701.68, 701.68, 701.68, 701.68, 715.57, 715.57, 715.57, 715.57, 715.57, 714.44, 714.44, 714.44, 714.44, 714.44, 712.26, 712.26, 712.26, 712.26, 712.26, 711.15, 711.15, 711.15, 711.15, 711.15, 714.24, 714.24, 714.24, 714.24, 714.24, 716.1, 716.1, 716.1, 716.1, 716.1, 731.73, 731.73, 731.73, 731.73, 731.73, 730.86, 730.86, 730.86, 730.86, 730.86, 724.3, 724.3, 724.3, 724.3, 724.3, 725.39, 725.39, 725.39, 725.39, 725.39, 714.99, 714.99, 714.99, 714.99, 714.99, 710.76, 710.76, 710.76, 710.76, 710.76, 713.53, 713.53, 713.53, 713.53, 713.53, 715.75, 715.75, 715.75, 715.75, 715.75, 715.22, 715.22, 715.22, 715.22, 715.22, 717.03, 717.03, 717.03, 717.03, 717.03, 714.69, 714.69, 714.69, 714.69, 714.69, 723.16, 723.16, 723.16, 723.16, 723.16, 728.33, 728.33, 728.33, 728.33, 728.33, 718.0, 718.0, 718.0, 718.0, 718.0, 717.19, 717.19, 717.19, 717.19, 717.19, 715.82, 715.82, 715.82, 715.82, 715.82, 714.82, 714.82, 714.82, 714.82, 714.82, 716.23, 716.23, 716.23, 716.23, 716.23, 719.34, 719.34, 719.34, 719.34, 719.34, 725.21, 725.21, 725.21, 725.21, 725.21, 722.99, 722.99, 722.99, 722.99, 722.99, 720.33, 720.33, 720.33, 720.33, 720.33, 719.02, 719.02, 719.02, 719.02, 719.02, 718.49, 718.49, 718.49, 718.49, 718.49, 715.0, 715.0, 715.0, 715.0, 715.0, 711.46, 711.46, 711.46, 711.46, 711.46, 716.09, 716.09, 716.09, 716.09, 716.09, 717.05, 717.05, 717.05, 717.05, 717.05, 717.04, 717.04, 717.04, 717.04, 717.04, 721.06, 721.06, 721.06, 721.06, 721.06, 723.57, 723.57, 723.57, 723.57, 723.57, 723.37, 723.37, 723.37, 723.37, 723.37, 724.5, 724.5, 724.5, 724.5, 724.5, 723.55, 723.55, 723.55, 723.55, 723.55, 724.1, 724.1, 724.1, 724.1, 724.1, 725.85, 725.85, 725.85, 725.85, 725.85, 726.08, 726.08, 726.08, 726.08]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 492 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712152788 --> 1712153414
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 28.99, 28.99, 28.99, 28.99, 28.99, 15.95, 15.95, 15.95, 15.95, 15.95, 17.05, 17.05, 17.05, 17.05, 17.05, 17.53, 17.53, 17.53, 17.53, 17.53, 18.59, 18.59, 18.59, 18.59, 18.59, 19.0, 19.0, 19.0, 19.0, 19.0, 19.63, 19.63, 19.63, 19.63, 19.63, 19.91, 19.91, 19.91, 19.91, 19.91, 20.06, 20.06, 20.06, 20.06, 20.06, 20.11, 20.11, 20.11, 20.11, 20.11, 20.04, 20.04, 20.04, 20.04, 20.04, 20.02, 20.02, 20.02, 20.02, 20.02, 19.7, 19.7, 19.7, 19.7, 19.7, 19.45, 19.45, 19.45, 19.45, 19.45, 19.18, 19.18, 19.18, 19.18, 19.18, 19.01, 19.01, 19.01, 19.01, 19.01, 18.73, 18.73, 18.73, 18.73, 18.73, 18.63, 18.63, 18.63, 18.63, 18.63, 18.72, 18.72, 18.72, 18.72, 18.72, 18.56, 18.56, 18.56, 18.56, 18.56, 18.45, 18.45, 18.45, 18.45, 18.45, 18.45, 18.45, 18.45, 18.45, 18.45, 18.25, 18.25, 18.25, 18.25, 18.25, 18.24, 18.24, 18.24, 18.24, 18.24, 18.29, 18.29, 18.29, 18.29, 18.29, 18.22, 18.22, 18.22, 18.22, 18.22, 18.3, 18.3, 18.3, 18.3, 18.3, 18.39, 18.39, 18.39, 18.39, 18.39, 18.39, 18.39, 18.39, 18.39, 18.39, 18.33, 18.33, 18.33, 18.33, 18.33, 18.43, 18.43, 18.43, 18.43, 18.43, 18.51, 18.51, 18.51, 18.51, 18.51, 18.6, 18.6, 18.6, 18.6, 18.6, 18.7, 18.7, 18.7, 18.7, 18.7, 18.79, 18.79, 18.79, 18.79, 18.79, 18.73, 18.73, 18.73, 18.73, 18.73, 18.72, 18.72, 18.72, 18.72, 18.72, 18.68, 18.68, 18.68, 18.68, 18.68, 18.58, 18.58, 18.58, 18.58, 18.58, 18.55, 18.55, 18.55, 18.55, 18.55, 18.59, 18.59, 18.59, 18.59, 18.59, 18.63, 18.63, 18.63, 18.63, 18.63, 18.67, 18.67, 18.67, 18.67, 18.67, 18.6, 18.6, 18.6, 18.6, 18.6, 18.53, 18.53, 18.53, 18.53, 18.53, 18.48, 18.48, 18.48, 18.48, 18.48, 18.45, 18.45, 18.45, 18.45, 18.45, 18.2, 18.2, 18.2, 18.2, 18.2, 17.91, 17.91, 17.91, 17.91, 17.91, 17.61, 17.61, 17.61, 17.61, 17.61, 17.6, 17.6, 17.6, 17.6, 17.6, 17.61, 17.61, 17.61, 17.61, 17.61, 17.68, 17.68, 17.68, 17.68, 17.68, 17.69, 17.69, 17.69, 17.69, 17.69, 17.75, 17.75, 17.75, 17.75, 17.75, 17.78, 17.78, 17.78, 17.78, 17.78, 17.77, 17.77, 17.77, 17.77, 17.77, 17.76, 17.76, 17.76, 17.76, 17.76, 17.73, 17.73, 17.73, 17.73, 17.73, 17.69, 17.69, 17.69, 17.69, 17.69, 17.68, 17.68, 17.68, 17.68]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 492 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712152788 --> 1712153414
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.27, 0.27, 0.27, 0.27, 0.27, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.34, 0.34, 0.34, 0.34, 0.34, 0.45, 0.45, 0.45, 0.45, 0.45, 0.46, 0.46, 0.46, 0.46, 0.46, 0.55, 0.55, 0.55, 0.55, 0.55, 0.57, 0.57, 0.57, 0.57, 0.57, 0.37, 0.37, 0.37, 0.37, 0.37, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.28, 0.28, 0.28, 0.28, 0.28, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 492 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712152788 --> 1712153414
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0]
                    
Loading

@kaizau
Copy link
Contributor Author

kaizau commented Apr 1, 2024

@ngxson @Jeximo Three updates and a question:

  1. I tweaked the python script to match the history and add_generation_prompt settings from tests/test_chat_template.cpp. Also included a shortcut for copying the output as a test.

    Updated script
    from transformers import AutoTokenizer
    
    VARIANTS_TO_TEST = [
        #'teknium/OpenHermes-2.5-Mistral-7B',
        # 'mistralai/Mistral-7B-Instruct-v0.2',
        # 'TheBloke/FusionNet_34Bx2_MoE-AWQ',
        # 'bofenghuang/vigogne-2-70b-chat',
        # 'mlabonne/AlphaMonarch-7B',
        # 'google/gemma-7b-it',
        # 'OrionStarAI/Orion-14B-Chat',
        # 'openbmb/MiniCPM-2B-dpo-fp32',
        'openchat/openchat-3.5-0106',
        'deepseek-ai/deepseek-coder-33b-instruct',
    ]
    
    HISTORY = [
        { 'role': 'system', 'content': 'You are a helpful assistant' },
        { 'role': 'user', 'content': 'Hello' },
        { 'role': 'assistant', 'content': 'Hi there' },
        { 'role': 'user', 'content': 'Who are you' },
        { 'role': 'assistant', 'content': '   I am an assistant   ' },
        { 'role': 'user', 'content': 'Another question' },
    ]
    
    for variant in VARIANTS_TO_TEST:
        history = [m for m in HISTORY] # copy
        if 'Mistral' in variant or 'gemma' in variant:
            history.pop(0) # no system prompt for mistral and gemma
        if 'gemma' in variant:
            # GemmaTokenizer is quite buggy, let's hard code the template here
            GEMMA_TMLP = "{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"
            print('[Gemma]')
            output = AutoTokenizer.from_pretrained(VARIANTS_TO_TEST[0]).apply_chat_template(history, tokenize=False, add_generation_prompt=True, chat_template=GEMMA_TMLP)
            print(output)
            print(output.replace("\n", "\\n"))
            print('-' * 30)
        else:
            print('[' + variant + ']')
            tokenizer = AutoTokenizer.from_pretrained(variant)
            output = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
            print(output)
            print(output.replace("\n", "\\n"))
            print('-' * 30)
  2. Used this output as the ground truth tests for updating the templates. And yup, found differences — fixed them.

  3. Replaced alpaca entirely with deepseek, re: above comments.

Question is about vicuna and vicuna-orca, and more generally, any models where automated detection isn't feasible. Would it make sense to support them if only through the --chat-template server flag? Or would you prefer I just cut them from this PR — maybe try to figure out an alternative later?

@Jeximo
Copy link
Contributor

Jeximo commented Apr 1, 2024

Would it make sense to support them if only through the --chat-template server flag?

When I search Orca, it's --chat-ml, but I also saw the older-styled templates for Nous/Tess. I like more options for chat templates personally, but I understand not wanting to complicate other development, so leave it up to yourself and @ngxson.

@ngxson
Copy link
Collaborator

ngxson commented Apr 1, 2024

Thanks for the efforts @kaizau . IMO chat/prompt template has always been a quite messy topic (rabbit hole as you said). You can see on the beginning of #4216 there was a discuss about that.

After some more researches I think it's OK to keep vicuna/vicuna-orca. While they does not have official jinja template, I think we can maybe ask the model's author to add one (or the one to convert it to gguf to add one). One of the thing I fear was that some templates do not have multi-turn capability from the beginning, like alpaca for example, but people try to retro-fit it. Turns out, that's not the case of vicuna, so it's safe to assume that all vicuna-based models support multi-turn.

@kaizau
Copy link
Contributor Author

kaizau commented Apr 2, 2024

@ngxson Makes sense. Any other code / formatting changes you'd like to see here? I'll draft up a readme update shortly.

Relatedly, a quirk I've noticed in using the OpenChat and Vicuna templates is that the first character of every assistant message is now always " ".

This is because these 3 templates all use ": " as the role separator — yet all of the official / reference add_generation_prompt examples exclude the space after the colon.

I can't tell if this is an oversight or as intended. Adding the space after the colon in each add_ass block gets rid of the problem — which is what I would lean towards (what I expect most users would prefer).

Did you encounter anything similar with previous chat templates?

@Jeximo
Copy link
Contributor

Jeximo commented Apr 2, 2024

first character of every assistant message is now always " "

readme under prompt states If the prompt is a string or an array with the first element given as a string, a bos token is inserted in the front like main does

I'm not sure of the correct solution - I had a similar experience with CLI in main: --in-prefix "GPT4 Correct User: " --in-suffix "<|end_of_turn|>GPT4 Correct Assistant:"

I included the space for User, and excluded it for Assistant in order to strickly adhere to the template. I think it's intentional, but I may be wrong.

@kaizau
Copy link
Contributor Author

kaizau commented Apr 2, 2024

@ngxson Was about to paste the readme update here, but realized I already had edit access to the page?

Either way, added the 4 templates: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template

I also added a "how to add a template" section that hopefully makes it incrementally easier for others. It includes the updated version of your script that outputs in a format identical to test-chat-template.cpp, to reduce room for human error.

@SamuelTallet
Copy link
Contributor

SamuelTallet commented Apr 2, 2024

@kaizau
Thanks for your efforts.

The part <s>GPT4 Correct System: in the Wiki seems incorrect.

OpenChat author said the system prompt should be appended without prefix.

Source: https://huggingface.co/openchat/openchat_3.5/discussions/5#65448109b4a3f3a2f486fd9d

@ngxson
Copy link
Collaborator

ngxson commented Apr 2, 2024

Relatedly, a quirk I've noticed in using the OpenChat and Vicuna templates is that the first character of every assistant message is now always " ".

That's because tokenizers tend to encode both the word and the space into the same token. For example, using https://platform.openai.com/tokenizer :

image

Adding a trailing space in the assistant prompt GPT4 Correct System: will make the model to perceive the sentence differently, because now the trailing space is encoded as a single character and not attached to the next word:

image

Sadly there's no other way to get rid of this problem. The root cause in fact is because this class of template does not have special tokens like <|im_start|>, they rely on common characters like : or space which is dependent on the next word.

@ngxson
Copy link
Collaborator

ngxson commented Apr 2, 2024

I also added a "how to add a template" section that hopefully makes it incrementally easier for others. It includes the updated version of your script that outputs in a format identical to test-chat-template.cpp, to reduce room for human error.

Nice, thanks! That looks good to me. I don't know how the permission system in wiki page works, but I glad to know that you have write access to wiki.

llama.cpp Outdated
for (auto message : chat) {
std::string role(message->role);
if (message == chat.front()) {
ss << "<|begin▁of▁sentence|>";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing left need to do is to remove this <|begin▁of▁sentence|>, this is because server already add BOS token to input prompt by default.

@ggerganov
Copy link
Owner

I don't know how the permission system in wiki page works, but I glad to know that you have write access to wiki.

It might be a good idea to restrict wiki access to collaborators, agree?

image

@kaizau
Copy link
Contributor Author

kaizau commented Apr 3, 2024

The part <s>GPT4 Correct System: in the Wiki seems incorrect.

OpenChat author said the system prompt should be appended without prefix.

Source: https://huggingface.co/openchat/openchat_3.5/discussions/5#65448109b4a3f3a2f486fd9d

@SamuelTallet Thanks! I saw that thread too and originally implemented the unprefixed version.

But running the actual Jinja template from the model's tokenizer_config.json produces <s>GPT4 Correct System: . So any implementation that actually uses the Jinja template would include a prefix...The readme also references the tokenizer.chat_template as the correct one.

This is unfortunately the state of templates right now. 🥲

I've left a comment asking for clarification, but will default to the unprefixed.

@kaizau
Copy link
Contributor Author

kaizau commented Apr 3, 2024

@ngxson Thanks for the explanation.

Just removed prefixes for both OpenChat and DeekSeek.

If the BOS token is automatically added, then my python script update probably oversells the extent to which copy-and-pasting the output as a test will work. The special tokens would have to be manually removed. But I can clarify that in the next wiki update.

Aside: I was also surprised to find I could edit the wiki directly — was fully expecting a "your edits are pending approval" screen when I hit save. 😅

@kaizau kaizau requested a review from ngxson April 3, 2024 14:41
@ngxson
Copy link
Collaborator

ngxson commented Apr 3, 2024

@ggerganov Yes, it is important to restrict write access to wiki. Ideally IMO we can allow only a list of people (not all contributors), but I'm not sure if this option is possible on github. The reason is because changes to wiki does not requires review. Bad actors may be able to exploit contributor's write access to change content on wiki.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@ngxson ngxson merged commit 1ff4d9f into ggerganov:master Apr 3, 2024
5 checks passed
@ggerganov
Copy link
Owner

Updated the wiki access:

image

Note, these are only the collaborators that have write access (i.e. not all contributors). Still, if we want to make this even stricter, it should be moved as doc files and committed to the repo

@Folko-Ven
Copy link
Contributor

@kaizau Hello, I apologize for disturbing you, but is there any hope for the addition of Mistral templates?

@ngxson
Copy link
Collaborator

ngxson commented Apr 4, 2024

@ggerganov Thanks. That's ok for now I think. We can consider moving wiki to doc files later. Personally, I still feel like the UI of wiki page is more simple to navigate.

@Folko-Ven Mistral uses llama2 template. Maybe we can add mistral as an alias for llama2 to clarify that.

@Folko-Ven
Copy link
Contributor

@ngxson It seemed to me that they are slightly different, aren’t they? Usually, I look at chat templates here - [link]

@ngxson
Copy link
Collaborator

ngxson commented Apr 4, 2024

We do support 3 variants of llama2. Mistral uses the variant with spaces around message content. As long as the model have the correct jinja template, it will be auto-detected and correct template will be used.

@Folko-Ven
Copy link
Contributor

@ngxson Got it, thanks for explaining!

@wtarreau
Copy link
Contributor

wtarreau commented Apr 6, 2024

Regarding the limitations to access the wiki from a list of people, the only solution we've found in haproxy was to create a dedicated project for the wiki and send invites to those who want to contribute. The main project's wiki is simply redirected to the wiki project and that solved the issues. but it's indeed annoying.

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
* Add openchat chat template

* Add chat template test for openchat

* Add chat template for vicuna

* Add chat template for orca-vicuna

* Add EOS for vicuna templates

* Combine vicuna chat templates

* Add tests for openchat and vicuna chat templates

* Add chat template for alpaca

* Add separate template name for vicuna-orca

* Remove alpaca, match deepseek with jinja output

* Regenerate chat template test with add_generation_prompt

* Separate deepseek bos from system message

* Match openchat template with jinja output

* Remove BOS token from templates, unprefix openchat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants