Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Chat templating code with text/json file based config; main chat updated to drive its in-prefix, in-suffix and reverse-prompt from same; chat-apply-template equivalent c-api to allow use by other codes also #6834

Draft
wants to merge 219 commits into
base: master
Choose a base branch
from

Conversation

hanishkvc
Copy link
Contributor

@hanishkvc hanishkvc commented Apr 22, 2024

*** Updated to match latest commit ***

Overview

Helps chat with models, by tagging chat messages based on the specified
chat-handshake-template-standard. This uses a generic tagging code driven
by a json meta data file, which specifies the handshake template details.

This can be used by

  • main, to build on existing interactive flow and its in-prefix, in-suffix
    and antiprompt/reverse-prompt

  • server, by replacing its existing llama_chat_apply_template with the
    equivalent helper here.

The common pattern

As a convention, the tagging used by LLMs to differentiate between the
different parts when chatting with them normally follows a general pattern of

  • <BeginOfSentenceIfAny> <RolePrefixIfAny> <TheContent> <RoleSuffixIfAny> <EndOfSentenceIfAny>

  • The Roles could include System, User and Assistant (ie the Model)

  • A chat normally consists of

    • a System message/prompt followed by

    • multiple user message/query - model message/response pairs

The different models will normally have all or some subset of the tagging mentioned above.

You may also notice some common patterns like

  • Because a user message is normally followed by model/assistant response, in most models

    • user messages wont have EndOfSentenceTag and

    • the following model response wont have BeginOfSentenceTag

  • Because a system message will normally be immidiately followed by a user query,

    • in many models, there wont be a EndOfSentenceTag following the system message and
      BeginOfSentenceTag wrt the 1st user message following the system message.

    • in some models there wont even be a RoleSuffixTag following system message
      and RolePrefixTag wrt the 1st user message following the system message.

    • however in many of these models, the subsequent user messages will have the
      BeginOfSentenceTag and or RolePrefixTag.

The Strategy

The template meta data json file allows the user to specify the above mentioned tags wrt
each of the Role. Depending on whether a given model uses a given tag or not you either
specify the required tag or else you specify a empty string.

A tag could be a single word or multiple words, and may include newline char specified
using \n and so on. The tag is always demarcated using double quotes and thus also allows
spaces at the begining or end of the tag, if needed.

In order to account for the conditionality of tags between the system message and the 1st
user message, flags are provided to explicitly control whether each of these possible tags
is used by a specific model or not, as part of its template info.

The Roles are identified in the json file using "system", "user" and "assistant". However
the model may use different words to identify these roles, in which case setup RolePrefix
and or RoleSuffix appropriately.

To identify that model is finished with generating response to user query, depending on
the model's handshake template standard, one will need to set the reverse-prompt to either
the assistant's suffix or end tag or to the user's begin or prefix tag, depending on what
is generated by the model at the end of its response.

The JSON File

Can contain the template info wrt multiple models/handshake-standards. And inturn each
unique template is identified by a unique template id string.

The fields that make up a given chat-handshake-template-standard include

  • global-> begin & end

  • system -> begin, prefix, suffix & end

  • user -> begin, prefix, suffix & end

  • assistant -> begin, prefix, suffix & end

  • reverse-prompt

  • systemuser-system-has-suffix, systemuser-system-has-end,
    systemuser-1st-user-has-begin and systemuser-1st-user-has-prefix

Usage

One needs to load the json file containing the template meta data and inturn call the
other helper functions as needed.

Inturn one can use the helper functions to either extract a given tag or to apply all
tags specified wrt a given role to the passed message or to apply tags as needed for
a bunch of messages in one go.

The individual message tagging helper, will apply all tags specified wrt that role.

The multiple messages tagging helper chaton-tmpl-apply, will look at the boolean flags
when tagging the passed messages. In this the system suffix, system end, user begin and
user prefix get included only if corresponding flag is set.

Both the single and multi messages tagging helpers provide two versions.

  • one which returns a single string which contains the tagged message(s)
  • one which returns
    • [tagged msg] the string containing the tagged message(s)
    • [parts lengths] an array of integers, which specifies the part lengths,
      which divides the returned string into parts.
    • [parts types] a string where each character indicates whether the corresponding
      part is a normal part which needs to be tokenized without parse_special
      or is a special part which needs to be tokenized with parse-special.

example/main

The interactive commandline program under example/main, uses

  • the system role related tags to tag the system prompt
    • the system prompt includes contents of -p if any
    • followed by contents of file specified using -f if any
  • the user begin+prefix to map to in-prefix
  • the user suffix+end to map to in-suffix
  • the reverse-prompt to map to antiprompt
  • wrt tokenization
    • the user specified system prompt is tokenized with parse_special flag.
    • however the user messages are tokenized without parse_special flag.

Currently Main doesnt use chaton-tmpl-apply, but only

  • chaton-tmpl-apply-single (for system prompt) and
  • chaton-tmpl-role-kv which maps the user prefix, suffix and reverse-prompt
    to in-prefix, in-suffix and antiprompt of main.
    These always adds any role specific begin+prefix and suffix+end around
    the passed message.

Adding support for new model / chat-handshake-template-standard

  1. Add suitable entries in json for that model/standard
  2. Try to reuse the generic flow in chaton-tmpl-apply, as much as possible,
    before trying to add a custom logic.
    If you update the generic flow, cross check if existing json files will
    need to be updated or not.

Notes

Look at the sample chaton_meta.json in examples folder for how the above may apply

  • llama2, llama3, gemma, chatml, zephyr, deepseek(normal and coder), monarch, mistral

@hanishkvc hanishkvc changed the title main chatting using a simple json based template which drives in-prefix, in-suffix and reverse-prompt and a generic chat-apply-template helper driven by flags from json main chat using simple json based template which drives in-prefix, in-suffix and reverse-prompt and a generic chat-apply-template helper driven by flags from same json Apr 22, 2024
@teleprint-me
Copy link
Contributor

This is interesting. The only issue I see with this is that it doesn't account for FIM (Fill-in-the-Middle). Other than that, it seems alright.

Something to note is that this, in practice, plays out a bit differently though and should be considered. For example, do we want to use only the file and/or the CLI options. I personally prefer simply using the file because it centralizes the template structure, exposes it to the API, and simplifies calling it.

There are always going to be injection risks, so maybe handle those separately. I'm just thinking out loud at the moment. Take this input with a grain of salt.

@hanishkvc
Copy link
Contributor Author

hanishkvc commented Apr 22, 2024

This is interesting. The only issue I see with this is that it doesn't account for FIM (Fill-in-the-Middle). Other than that, it seems alright.

Something to note is that this, in practice, plays out a bit differently though and should be considered. For example, do we want to use only the file and/or the CLI options. I personally prefer simply using the file because it centralizes the template structure, exposes it to the API, and simplifies calling it.

There are always going to be injection risks, so maybe handle those separately. I'm just thinking out loud at the moment. Take this input with a grain of salt.

By fill in the middle, if you mean that if the user message has special-token related tags in it which inturn when being tokenised will treat them has special tokens, which can mess with things etal, then if you look at the flow wrt main, the user message is tokenized without parse_special flag.

However my generic chat-apply-template currently, doesnt handle this, because it would require returning a vector of strings rather than a single string, as noted in the PR comment. Which if I am not wrong would be different from how others expect chat-apply-template to work, so I havent decided on the same, nor have I looked into other libraries chat-apply-template in detail, I am guessing a bit here.

However if you mean something else, please do explain a bit, so I can see if I can do something about it. Do note that I am not a big user of current crop of LLMs for various reasons, while still do look at it once in a while to see where things are, so I am not that tuned in with the conventions / concept-names etal.

I wanted a simple program with minimal inter dependencies to use on my limited resources based machine, and I had some issues with ollama and llama3, so I just hacked this in with mostly guess work and crude generalisation by looking at existing flow to some extent and what I was seeing when I experimented on what I needed. I am hacking xyz, without understanding abc in some sense.

@hanishkvc
Copy link
Contributor Author

This is interesting. The only issue I see with this is that it doesn't account for FIM (Fill-in-the-Middle). Other than that, it seems alright.
Something to note is that this, in practice, plays out a bit differently though and should be considered. For example, do we want to use only the file and/or the CLI options. I personally prefer simply using the file because it centralizes the template structure, exposes it to the API, and simplifies calling it.
There are always going to be injection risks, so maybe handle those separately. I'm just thinking out loud at the moment. Take this input with a grain of salt.

By fill in the middle, if you mean that if the user message has special-token related tags in it which inturn when being tokenised will treat them has special tokens, which can mess with things etal, then if you look at the flow wrt main, the user message is tokenized without parse_special flag.

However my generic chat-apply-template currently, doesnt handle this, because it would require returning a vector of strings rather than a single string, as noted in the PR comment. Which if I am not wrong would be different from how others expect chat-apply-template to work, so I havent decided on the same, nor have I looked into other libraries chat-apply-template in detail, I am guessing a bit here.

However if you mean something else, please do explain a bit, so I can see if I can do something about it. Do note that I am not a big user of current crop of LLMs for various reasons, while still do look at it once in a while to see where things are, so I am not that tuned in with the conventions / concept-names etal.

I wanted a simple program with minimal inter dependencies to use on my limited resources based machine, and I had some issues with ollama and llama3, so I just hacked this in with mostly guess work and crude generalisation by looking at existing flow to some extent and what I was seeing when I experimented on what I needed. I am hacking xyz, without understanding abc in some sense.

Or are you meaning coding related models and I dont know, if they have some fill-in-the-blank or is it fill-in-the-middle or some such phrase I may have previously seen wrt them, I dont remember now, I havent looked at them, if it is something like that you are talking about, I have to look at it.

Be it general LLM or coding related LLM and you are talking about it filling some blanks in the middle of a statement the user has entered, then I assume, user will put some special tokens in the middle of their prompt, in which case the user message will have to be tokenized using parse_special, if that is what you are talking about, then maybe a cmdline argument can be added to inform the logic, whether to treat user message as a normal text or has potentially including special token related tags

@teleprint-me
Copy link
Contributor

teleprint-me commented Apr 23, 2024

Or are you meaning coding related models and I dont know, if they have some fill-in-the-blank or is it fill-in-the-middle

Yes, this is what I meant. One of the models (that I know of) that's capable of infill is the Refact model. Sorry if I caused confusion or made assumptions.

@hanishkvc
Copy link
Contributor Author

Updated notes

Overview

Helps chat with a model, by allowing role based special token tagging, based on the specified chat-handshake-template-standard.
This is used by main, to build on existing interactive flow and its in-prefix, in-suffix and antiprompt/reverse-promot

  1. Use a json file to configure the needed tags for each of the supported chat-handshake-template-standard

    a. system -> prefix & suffix,

    b. user -> prefix & suffix, assistant -> prefix

    • [main] these override the in-prefix and in-suffix

    c. reverse-prompt

    • [main] this adds to any reverese-prompt specified using cmdline

    d. global -> begin & end

    e. systemuser-1st-user-has-prefix

    • if a combination of system and user messages/prompts is passed,
      then for the 1st user message following the 1st system message,
      include user prefix only if this flag is set. [chaton-tmpl-apply]

    • [later] one or two models which I looked at seem to require not just BoS, but also the user-role-prefix-tag
      to also be controlled wrt this case. So not differentiating between BoS and any user-role-prefix-tag.
      However if bos and user-role-prefix-tag need to be decoupled, where only bos needs this treatment,
      then maybe add begin and end keys (to specify the BoS) in addition to prefix and suffix keys (to specify user-role-prefix-tag),
      to role blocks in the json, and inturn control only begin and not prefix, wrt whether to add or not.

  2. [main] currently the user specified system prompt (-p + -f) is tagged using system role tags,
    and inturn this tagged message is tokenized with parse_special flag.
    So any special token related tags in the user specified system prompt will get parsed as special.

  3. chaton-tmpl-apply uses the json file, which was loaded, to decide on how to generate the tagged messages for tokenisation.

    a. input: [ { role, message }, { role, message}, ....]

    b. output: currently a single string is returned which contains the tagged message(s).

    [later] if it is needed to differentiate between the special tags added by this from user specified prompts/messages,
    then return [ {flag, data}, { flag, data}, {flag, data}, ....],
    where the flag specifies whether parse_special should be used or not for the corresponding data, during tokenization.

Adding support for new model / chat-handshake-template-standard

  1. Add suitable entries in json for that model/standard

  2. Update the flow in chaton-tmpl-apply, as needed.

    Try to update and or reuse the generic flow in chaton-tmpl-apply, as much as possible,
    before trying to add a custom logic.
    If you update the generic flow, cross check if existing json files will need to be updated or not.

Notes

Currently Main doesnt use chaton-tmpl-apply, but only

  • chaton-tmpl-apply-single (for system prompt) and
  • chaton-tmpl-role-part which maps the user prefix, suffix and reverse-prompt to
    in-prefix, in-suffix and antiprompt of main.
    These always adds any role specific prefix and suffix around the passed message.

@hanishkvc
Copy link
Contributor Author

hanishkvc commented Apr 23, 2024

Sample chaton_meta.json includes template info for

  • llama2
  • llama3
  • gemma
  • chatml
  • zephyr
  • deepseek

I noticed some difference between deepseek's actual tokenizer config and what is there in llama.cpp's chat-apply-template, so for my logic, I have added two entries deepseek-alt (which matches existing llama.cpp tempalte) and deepseek (which matches role related tags and eos from tokenizer_config.json). However both will potentially work.

Later need to cross check the tokenizer_config.json of the other models, with what I have put in chaton_meta.json, to see if they are in sync or not. However based on minimal testing of these models, the existing template in chaton_meta.json does seem to work.

NOTE: Even if there is some difference in EoS specified using reverse-prompt, chances are the default logic in main already looks for the EoS specified in the model file loaded also, so things should still be fine, even if the json doesnt match the one in model.

@hanishkvc
Copy link
Contributor Author

Or are you meaning coding related models and I dont know, if they have some fill-in-the-blank or is it fill-in-the-middle

Yes, this is what I meant. One of the models (that I know of) that's capable of infill is the Refact model. Sorry if I caused confusion or made assumptions.

In middle of somethings, but later will try look into this, as well as add cmdline option to control whether user prompt is parsed wrt special tokens or not.

@hanishkvc
Copy link
Contributor Author

Have added support for Begin and Prefix entries wrt User role and inturn one can configure both of them individually wrt whether either of them get added to the 1st user message following the system message from chat-template-apply perspective, in commons/chaton.hpp.

Look at llama3, llama2 and monarch entries in the examples/chaton_meta.json wrt how things can differ wrt begin and prefix and inturn 1st user msg following system message.

@ngxson
Copy link
Collaborator

ngxson commented Apr 23, 2024

At first lance, I'm not sure if it's a good idea to move the implementation completely into a separated JSON. While the good point is that it allows users to edit the list of templates easily, it brings some problems:

  • This API is now outside of llama.h and can't be use in other examples
  • It depends on json.hpp which is again, not part of llama.h

Also, could you do a test implementation with the examples in tests/test-chat-template.cpp? Automate test will make the PR easier to follow (and to verify if the idea works)

@hanishkvc
Copy link
Contributor Author

hanishkvc commented Apr 23, 2024

At first lance, I'm not sure if it's a good idea to move the implementation completely into a separated JSON. While the good point is that it allows users to edit the list of templates easily, it brings some problems:

  • This API is now outside of llama.h and can't be use in other examples
  • It depends on json.hpp which is again, not part of llama.h

Also, could you do a test implementation with the examples in tests/test-chat-template.cpp? Automate test will make the PR easier to follow (and to verify if the idea works)

@ngxson for now I purposefully kept the new flow outside llama.h and within common/chaton.hpp for these reasons

  1. As this is currently still an experimentation to cross check this mechanism can work in general across models, as well as wrt web-service/server flow as well as a normal cmdline application flow (main). so didnt want to step on the current llama-chat-apply-template, till this is validated.

  2. As I had mentioned, I have a fundamental issue with current llama-chat-apply-template api, in that it merges user prompt and chat-handshake-template-standard tags/special-tokens into a single string, which inturn will be tokenized with parse_special flag, which would allow a user to override/modify system prompt or other aspects from under the normal handshake by inserting special tokens into user prompt, etal. While ideally it should be configurable, ie in some cases you may want this flexiblity and in some cases, you wont want this flexibility.

And inturn providing flexibility wrt (2) would either way require adding a new api wrt chat template applying, while potentially retaining the old api through potentially a wrapper over a more flexible newer api.

As a possible step towards that more flexible flow, on experimenting towards same, I have added the initial skelton of ChatParts class in common/chaton.hpp (Note: I have been away from C++ for 1.5-2-decade++ now, and jumped through too many languages which were at much lower or similar or higher abstraction compared to c++ over the years, so my c++ memory is only so and so, and I have depended more on the compiler not warning/erroring out and not stricitly throught through from memory mgmt perspective of the new classes in c++ etal, so there could be some inefficiencies and or gotchas in there).

Also if it makes sense to expose a more flexible api to differentiate between special-tokens++ parts and user provided parts in the formated/tagged string, then whether to expose it through the C only hop in llama.h by using

  • a single string + array of start-end indexes which relate to parts of string which either relate to special-tokens parts or the other part
  • or return a array of strings and a additional string with each char indicating how to treat each of the individual strings in the returned array
  • or ...

Also I remember earlier today somewhere reading about possible deprecating of antiprompt/reverse-prompt, but rather I feel the EoS/EoG tracking by main's logic should be built on top of antiprompt in that antiprompt vector should maintain a bunch of possible antiprompts which can be filled from the EoS info in the model file itself, any commandline argument passed by user as well as potentially set from a chat-template manager like chat-template-apply driven logic. The reason is because if I am not wrong, some of the models may allow more than 1 chat-handshake-template-standard, in which case the model file may not explicitly provide all of the possible EoS/EoG token(s) across all of their supported standards. So retaining the antiprompt vector provides flexibility for multiple levels of intervening like what I mentioned.

This was originally a weekend project, to solve a immidiate issue I had at my end. And later to see if there can be a generic flow, which can be ideally modified and or extended in future for models/standards which follow a sensible convention, without needing to modify code. And the skeleton which I have added in chaton.hpp seems to provide that for the 5 to 6 models which I tested at a minimal level (ie few back and forth handshake using main interactive flow augumented with my logic/PR) and the corresponding entries added to chaton_meta.json in my PR.

I glanced through test-chat-template.cpp, but I feel currently it uses a vector of chat templates from models or ... without identifying the individual templates explicitly, like through a map instead of a vector or so, thus requiring to manually map each template with the chat-apply-template code to see what model/standard it may be mapping to. I will see if I can create a duplicate file which uses this alternate chat-template-apply logic, after I have flushed out Chatparts a bit more.

@hanishkvc
Copy link
Contributor Author

Also as json library seems to be co-opted into llama.cpp/common, so I used the same and built my concept on top of it actually.

If the logic in this PR works out in handling most of the sensible model/standards out there using a generic flow, and inturn if there is interest in this PR, then may be we can avoid json and replace it with a simple 1-level heirarchy text file something like below and simple parser for it.

Template-id1
\t key1: "value"
\t key2: true|false
\t user-prefix: "line 1 content \n line 2 ...\n line 3"
\t user-begin: "value ..."

Template-Id2
\t key1: value4k1
\t key2: value4K2

....

@ngxson
Copy link
Collaborator

ngxson commented Apr 24, 2024

I've just have a look in detail for this PR. The idea seems ok (i.e. using input_prefix/input_suffix/antiprompt), but I still find the implementation is quite complicated IMO:

  • I still prefer not to rely on JSON, since it makes the compiled binary not very portable
  • Not sure how we can handle conditional prefix/postfix, for example llama2 with/without system message
  • The code has some level of abstractions, for example ChatParts that does not fit very well with the code style of llama.cpp. It makes me feel a bit like the logic is designed for higher-level languages like python

Also I don't really understand the differences between this PR and #6822 , as I'm also trying to implement a system of prefix/postfix for chat templates. Can you explain this a bit more?

Also I remember earlier today somewhere reading about possible deprecating of antiprompt/reverse-prompt, but rather I feel the EoS/EoG tracking by main's logic should be built on top of antiprompt in that antiprompt vector should maintain a bunch of possible antiprompts which can be filled from the EoS info in the model file itself

The problem is that all the new chat templates are moved away from antiprompt. They're all using special token to stop generation. This will still be true for all future models, so I don't think antiprompt is something that is future-proof (but special tokens are)

@hanishkvc
Copy link
Contributor Author

hanishkvc commented Apr 24, 2024

@ngxson hope below gives some more background and or info on the idea behind this PR

I've just have a look in detail for this PR. The idea seems ok (i.e. using input_prefix/input_suffix/antiprompt), but I still find the implementation is quite complicated IMO:

  • I still prefer not to rely on JSON, since it makes the compiled binary not very portable

Based on further experimentation, if it is found that a good number of chat-handshake-template-standards can be driven using a config file (json/...), then as I had mentioned in previous comment, we could look at a simple text file based config file instead of json, so that the code can be portable, without depending on a seperate json library.

  • Not sure how we can handle conditional prefix/postfix, for example llama2 with/without system message

If you are talking about if there is a system+user message one kind of tagging is required and for user only message a different kind of tagging is required, using systemuser-user-1st-has-begin/prefix flag in the json file, I have tried to handle the difference in tagging across many models/standards. However I agree that there may be few more variations when looked across multiple models/chats. I am looking at a more detailed (in terms of fields) json to see if more combinations can be covered, and that too without adding more custom flags. Maybe tomorrow I will give it a shot.

However do note that if we are looking at a pure main program based chatting, yesterdays simple json and corresponding logic already allows chatting with around 5 to 6 models which I have tested yesterday. However wrt server/web-service related flow, I need to cross check with the more detailed json, because some more variations come into picture.

  • The code has some level of abstractions, for example ChatParts that does not fit very well with the code style of llama.cpp. It makes me feel a bit like the logic is designed for higher-level languages like python

If you read my previous comments, as I had mentioned, ChatParts is more to help keep the different parts that make up a tagged message/chat seperate so that additional data can be extracted to tokenize in a more fine grained manner. However at the same time to allow exposing the api interface over a standard c-extern gating, instead of ChatParts, its helpers can be used to expose the additional info using a array of chars and array of ints. My todays commit already has this mechanism implemented, do have a look, to see, what I mean.

You will see that people who want to follow the old api related flow of working with a single tagged string as is, they can do that, at the same time additional info is exposed, if they want to tokenize user prompt parts different from the tags parts.

Also I don't really understand the differences between this PR and #6822 , as I'm also trying to implement a system of prefix/postfix for chat templates. Can you explain this a bit more?

If I am not wrong, you are looking at implementing the prefix/postfix using hardcoded tags in the code, while this PR tries to see if the needed functionality can be achieved by using a combination of json/text based config file + code, with the idea being to try allow end users to manipulate tagging to some extent, without needing to recompile things. As well as try allow new modes/standards to be supported using a generic flow where possible.

ALERT: This is still a experiment, I need to cross check this bit more, before I can categorically say that this can handle most common combinations or not.

Also I remember earlier today somewhere reading about possible deprecating of antiprompt/reverse-prompt, but rather I feel the EoS/EoG tracking by main's logic should be built on top of antiprompt in that antiprompt vector should maintain a bunch of possible antiprompts which can be filled from the EoS info in the model file itself

The problem is that all the new chat templates are moved away from antiprompt. They're all using special token to stop generation. This will still be true for all future models, so I don't think antiprompt is something that is future-proof (but special tokens are)

Rather you seem to have looked at only a part of my comment, if you read the para fully, you will see the reason why I have suggested to retain the current flexible antiprompt mechanism and then to add the EoS from the model file into the antiprompt flow itself ie by inserting the EoS info in the model to the antiprompt vector.

@hanishkvc
Copy link
Contributor Author

hanishkvc commented Apr 24, 2024

By more detailed json/text config file to try support more combinations parallley without too many flags, what I am thinking of (need to cross check)

  • global-> begin & end

  • system -> begin, prefix, suffix & end

  • user -> begin, prefix, suffix & end; assistant -> begin, prefix, suffix & end

    • [main] these override the in-prefix (begin+prefix) and in-suffix (suffix+end)
  • reverse-prompt

    • [main] this adds to any reverese-prompt specified using cmdline
  • systemuser-sys-has-suffix, systemuser-sys-has-end, systemuser-1st-user-has-begin and systemuser-1st-user-has-prefix

    • [chaton-tmpl-apply] if a combination of system and user messages/prompts is passed, then for system messages suffix and end, as well as for the 1st user message following the 1st system message, include system suffix and end and user begin and prefix only if corresponding flags is set.

    • begin should normally relate to BoS while prefix should relate to Role Identifier tag. If there is no need for seperate handling of BoS and RoleIdTag, then one could even set both BoS and RoleIdTag to one of these entries itself.

This is still just a initial idea in my mind by looking at few jinga files, I need to think through and try out this detailed fields based flow still. However the existing simpler json and corresponding support added to drive main's in-prefix/suffix does work for main based chatting. Its the server/web-service kind of flow, where this more detailed fields based flow needs to be thought through bit more and cross checked.

Also the idea is to try and see if a common generic logic can be used to drive templating for many models/standards, while still providing the flexiblity to hardcode in code if required for specific models/standards.

@hanishkvc hanishkvc force-pushed the hkvc_chaton_v3 branch 2 times, most recently from aa66db1 to 30efa0b Compare April 27, 2024 00:35
@hanishkvc
Copy link
Contributor Author

hanishkvc commented Apr 27, 2024

@ngxson have a look at the latest commit here, using a simple generic logic (which you can checkout in chaton_tmpl_apply_ex function) and a json file containing the details of the chat-template in a simple and detailed way, this logic tries to allow tagging of the messages across different models/template standards.

For around 9 models/chat-handshake-template-standards I have included sample json config in examples/chaton_meta.json

  • llama2, llama3, gemma, chatml, zephyr, deepseek(normal and coder), monarch, mistral

The c-api which follows a similar semantic as the previous llama_chat_apply_template, is available in the common/chaton.hpp.

As the models for which I have added sample tempate config info and inturn checked using modified main, is bit different from those in test-chat-templates, so I have add a new test-chat-template-chaton.cpp to tests folder, which if you run will show what will be the tagged messages wrt the 9 models which I have mentioned above, so that you can check if what it generates is similar to what you may be expecting or not.

I feel this mechanism of a generic flow driven by a json is vaible in general, based on the models which I have tested against. And either way, if a particular model requires a very different structure beyond what can be generated by the generic logic, one can always add custom code into chaton_tmpl_apply_ex. This should allow supporting new models in many cases, by just adding to the json config file.

Also do go through the detailed Notes/Comments at the begining of the common/chaton.cpp to get a rough feel about this code.

@ngxson
Copy link
Collaborator

ngxson commented Apr 28, 2024

I understand the high level idea but sorry I really don't have time to look at the detailed implementation. While it's a good idea, IMO the chat template infrastructure should be kept simple and support for customizable formats can be added later on. Maybe we can keep your PR as a demo and we will see if it can be merged in the future.

Also for context, there's already a discussion on chat templates in the beginning of server development. You can have a look here: #4216 (comment)

@hanishkvc
Copy link
Contributor Author

hanishkvc commented Apr 28, 2024

I understand the high level idea but sorry I really don't have time to look at the detailed implementation. While it's a good idea, IMO the chat template infrastructure should be kept simple and support for customizable formats can be added later on. Maybe we can keep your PR as a demo and we will see if it can be merged in the future.

Also for context, there's already a discussion on chat templates in the beginning of server development. You can have a look here: #4216 (comment)

Hi @ngxson, @ggerganov

generic code flow + config file based template handling

please do have a look at the implementation, the generic flow is actually very simple yet flexible, and I feel this idea can accomodate many different models / handshake standards by just updating the config file without touching the code flow (there could be some small differences in terms of white spaces in the worst case, which potentially may not matter beyond a limit, even that may be handleable by adding some generic flags like trim content or so, but it may make it unnecessiraly too detailed a control).

I have tried to add support for 8(+1) different models/standards in examples/chaton_meta.json, all by using the generic flow itself, without requiring any customization in code to accomodate that specific model/standard. At a initial glance the tagged messages seem ok to me, but it would be useful for someone else to also cross check once to be sure.

To test wrt main one needs to use

  • bin/main -m path/to/llama3.gguf -i --chaton-meta-json ../examples/chaton_meta.json --chaton-template-id llama3 -p "You are a monster " -f ../prompts/chat-with-bob.txt

To test the possible server flow related multi message tagging at once

  • bin/test-chat-template-chaton ../examples/chaton_meta.json

This PR specific code is in common/chaton.hpp and inturn the generic logic which uses the config file to do the tagging is in the function chaton_tmpl_apply_ex. You will notice that the generic flow basically just builds on the basic pattern used by most models/standards in a simple and straight forward way, without much complexity.

It also provides the basic plumbing for differentiating between the user provided parts and the handshake template provided parts in the tagged message, so that in future, if required the tokenisation can be controlled interms of using parse_special for the template provided parts and avoiding parse_special for end user entered parts ie their querys during chatting, if needed. This is currently not exposed in the c api.

Because this config file + associated generic flow tries to expose all parts of the generic pattern followed wrt all the 3 roles, so anyone wanting to experiment with different templates for a given model, will also be potentially able to do that, by just updating the config file. Unless one is doing some fancy conditional inserting of tokens etal beyond the basic system+1st-user-message related one which I have seen in the 8(+1) models, that I have checked to some extent. In which case they will have to add custom code, like what they would have done even now in the existing flow. (this partly relates to a query/comment I noticed wrt PR #4216)

simple text based config file

As you had noted a concern about this potentially making the users of the core llama.cpp needing to bring in json library, if this config file based flow is used, I have added a simple text based config file logic, to try and avoid the dependence on json, while still giving sufficient flexibility for this use.

The code for the same is in common/simpcfg.hpp

and the sample simpcfg text based config file is in examples/chaton_meta.simpcfg

NOTE: currently I have not updated the chaton.hpp to use this simpcfg based files instead of json files. If all of you find that the chaton generic flow is doing what is expected in a sufficiently proper enough way, and inturn that, it is better to avoid needing json dependency wrt 3rd party users of llama.cpp as a library, then I can look at replacing the json (picked from what was already in common dir) with this simpcfg based flow.

Note

Do have a look at the note in the chaton.hpp for uptodate overall flow and reasoning. For now the 1st note in this PR conversion, is updated to match the note in chaton.hpp.

Also I agree that lets not look from merging angle yet, only after both of you and any others with knowledge that you want to look at this flow, have gone through it and find that it seems to be ok and flexible enough, we can look at merging

NOTE: I am a casual (non-regular) user of LLMs as well as llama.cpp, so dont have that much experience with it beyond basics, but I feel if this idea works out, as I feel it seems to currently, then in future for many new models/chat-handshake-template-standards if they follow a sane generic pattern as many seem to be, then the generic flow itself will be able to support those, by just updating the config file, without needing to modify code and recompile it. However I need eyes from experienced users and developers of llama.cpp like you to cross check, if what I am seeing with my limited testing actually makes sense.

NOTE: If new models/standards follow a sane pattern, then other than updating the config file, the only change that may be required in code, is in tokenizer wrt any new specifial tokens that they may have added or different encoding for existing special token tag or so, ie if there is no generic way to pick this info across models from their model file. This is a logical guess based on my limited knowledge of llama.cpp and llms in general.

@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 4, 2024

Updates wrt

SimpCfg

  • Provide logic for more proper trimming wrt more languages by converting to wstring, disabled by default
  • switch to variant and templates based logic, to avoid duplication, as well as to allow for easier new data type additions if required in future.
  • support for vectors/arrays of supported data types in a verbose/expanded out way
  • ensure true/false bool is case insensitive when loading for the text based simple config file
  • avoid maintaining enclosing double quotes from the config file wrt string values

ChatOn

  • add support for Phi3 model to chaton_meta.json

hanishkvc added 4 commits May 14, 2024 01:19
Make it similar to user-begin+prefix control. ie only wrt 1st msg
of respective type.
Use same to bypass any msg count based tagging behaviour for the
single message tagging through its helper wrapper.
However still retain the wrappers, which work with a predefined
global instance of ChatTemplates.
@mofosyne mofosyne marked this pull request as draft May 14, 2024 00:55
hanishkvc added 4 commits May 14, 2024 18:45
GroupKV dump adds needed ":" seperator on its own, so calling
functions can just pass the tag string they want in the log without
worrying about any demarkation.
Also add simple note wrt itself and its helper.
The initial version was rooted around a json object, while the new
version is rooted around a MapOfMapOfVariant (GroupKV), which could
be preloaded with chat templates info at compile time itself and
used as is. Or optionally one could allow the configurable template
data to be extended/updated at runtime from a text(/SimpCfg)/json
file.
@hanishkvc
Copy link
Contributor Author

Hi @khimaros,

This patch auto sets the example/main's in-prefix/suffix as well as antiprompt/reverse-prompt from the equivalent configuration data in the specified chaton_meta.json file, that is the reason, why its no longer required to be explicitly specified.

The extra "\n> " you are seeing is the only-visible-to-end-user prompt added by the existing main code, as I reuse/extend the existing main flow, you see the same.

answering my own question: it seems this hasn't been incorporated into the server yet. seems there's another branch for that.

i'm testing out the command-r chaton configuration using the following incantation:

./main --temp 0.7 --repeat_penalty 1.1 --model ./models/c4ai-command-r-v01-Q6_K.gguf --ctx_size 4096 --threads 8 --n_predict 2048 --color --interactive --file /tmp/llamacpp_prompt.enm73Qj.txt --reverse-prompt USER: --in-prefix ' ' --chaton-meta-json ./examples/chaton_meta.json --chaton-template-id command-r

with the following in the prompt --file:

This is a conversation between USER and COMMANDR, a friendly chatbot. COMMANDR is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision

i have the following observations:

  • previously, it was idiomatic to add USER: as the last line of the prompt string. this is no longer needed.
  • it is no longer necessary to provide a reverse prompt of USER: nor an input prefix of in order to control the flow
  • seeing all of the tokens in the chat log is a bit disorienting and less immersive than seeing USER: / ASSISTANT: as it is in master branch. i wonder what the tradeoffs would be of hiding the tokens from the user?
  • i'm seeing spurious whitespace and > just before the input prefix token
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>This is a conversation between USER and COMMANDR, a friendly chatbot. COMMANDR is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.<|END_OF_TURN_TOKEN|>

>  <|START_OF_TURN_TOKEN|><|USER_TOKEN|>

@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 14, 2024

Hi @ggerganov @ngxson @teleprint-me @khimaros @mofosyne

The initial/previous version was rooted around a json object, while the new version is rooted around a MapOfMapOfVariant (GroupKV), which could be preloaded with chat templates info at compile time itself and used as is. Or optionally one could allow the configurable template data to be extended/updated at runtime from a text(/SimpCfg)/json file.

Thus this new flow should allow for using the new chat templating logic without needing to load additional data at runtime, if one doesnt want to, thus also avoiding need to bring in common/json library.

At the same time for a use case like examples/main where it is useful to allow the user to either change the existing (pre/compiled-in) template info and or try adding support for new models/finetunes/template-standards, the same can be achieved by loading it from json file.

Optionally in some use-cases, if one wants the runtime augumenting capability but still doesnt want to bring in the common/json, then one could optionally switch ChatTemplates to use SimpCfg (which builds on GroupKV) and inturn use its load logic to load from a simple text file.

The Notes in common/chaton.hpp has been updated to capture the new mechanism.

Currently by default CHATON_JSON (which brings in json based loading) as well as GKV_DEBUGLOG_ON (which makes the logic more log verbose) is enabled, which needs to be disabled. Rather as I was writing this, come to think of it, I need to move the CHATON_JSON block into its own file, so that the library by default can be compiled without needing json and inturn only programs which use it like main can include the new file with this json based loading helper.

NOTE: The compile time pre/compiled-in configurable template data is picked from chaton_meta.hpp. There is a simple and stupid minded python helper added to scripts to convert from chaton_meta.json to chaton_meta.hpp.

NOTE: Currently I have not updated the code to follow some of the naming/coding convention mentioned.

hanishkvc added 9 commits May 15, 2024 02:11
Any program which wants to use json file to update/extend the
chaton's configurable template data, can include this new file
chaton_json.hpp, to get the reqd functionality.

Update chaton_meta_ok, _chaton_meta_validate_dump and
chaton_meta_load_json to either work with a passed ChatTemplates
instance, or fallback to the compiled-in global instance of same.
Merge upstream as of 20240515IST11XY
@hanishkvc
Copy link
Contributor Author

hanishkvc commented May 15, 2024

Hi @ggerganov @ngxson @mofosyne

Just to give a rough context, for a code using the existing chat template logic like examples/server, a simple change like below will allow it to use the new chat template logic from this PR. Once the code+setup is updated to exist in llama(.cpp) library rather than common library. Along with specifying/passing the tempalte-id rather than passing the ninja template string/... wrt the template argument (I have hardcoded to llama3 below).

I have done a crude transplanting of chaton into llama.cpp in the below repo, in case if anyone wants to test it. Note that this doesnt integrate chaton into llama library in a proper way, nor does it take cmdline argument wrt template-id etal.

https://github.com/hanishkvc/experiment-ai-tools-llama.cpp/tree/hkvc_chaton_v3_crude_server_v1

`
diff --git a/llama.cpp b/llama.cpp
index 7d26966e..a72da101 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -17983,6 +17983,8 @@ static int32_t llama_chat_apply_template_internal(
return dest.size();
}

+#include <chaton.hpp>
+
LLAMA_API int32_t llama_chat_apply_template(
const struct llama_model * model,
const char * tmpl,
@@ -18014,7 +18016,8 @@ LLAMA_API int32_t llama_chat_apply_template(
}

 std::string formatted_chat;
  • int32_t res = llama_chat_apply_template_internal(curr_tmpl, chat_vec, formatted_chat, add_ass);
  • //int32_t res = llama_chat_apply_template_internal(curr_tmpl, chat_vec, formatted_chat, add_ass);
  • int32_t res = chaton_tmpl_apply("llama3", chat_vec, add_ass, true, formatted_chat);
    if (res < 0) {
    return res;
    }

`

NOTE: Inserting code doesnt seem to work properly wrt the comment's include code mechanism or so. Or I dont understand its proper use. So the above may look bit odd.

hanishkvc added 4 commits May 16, 2024 12:22
Rename chaton-meta hpp to cpp and include this cpp file which brings
in the compile time built-in global chaton configurable template data
into the common library, and avoid the nop hpp file references.

Update chaton.hpp to not include the meta-cpp, instead just make a
reference to the global ChatTemplates instance, so that the hpp can
be used as a header file proper.

Avoid pragma once in the chaton-meta.cpp, including the script, which
helps create it.
C++17 provides a good enough variant as a standard feature, and
chaton uses the same at its core, instead of rolling out its own
struct of union based variant. And given that currently chaton
is part of common library and not the base llama library, so limit
the use of c++17 to common library. Initially while experimenting,
had set the flag for full llama, limitting it for now.

Also by now most embedded targets should be potentially having c++
compilers and libraries with support for c++17 features. So chances
are it is a ok enough path to take.
@hanishkvc
Copy link
Contributor Author

Hi @ngxson,

I was trying to test how server would work, if it is updated to use the current version of this PR, inturn with the minimalist of changes. So I basically changed llama_chat_appy_template in llama.cpp to call my chaton_tmpl_apply_ex rather than llama_chat_apply_template_internal.

With that change, what I am noticing is that It appears like examples/server doesnt call llama_chat_apply_template beyond the initial generic test that it does before the actual user interaction. But once user puts some chat content, I see that HandleCompletion calls into ServerContextTokenize which inturn seems to directly tokenizes the user visible text (their own as well as model responses) directly without any of the special tokens wrt demarkating of system/user/model messages.

Am I missing something fundamental, or is it what the server code is currently setup to do, is it that I need to pass any additional argument beyond -m THE_MODEL.

The code I used with the above mentioned patch is at

https://github.com/hanishkvc/experiment-ai-tools-llama.cpp/tree/tag-experiment-20240517IST0016-server-chaton

bin/server -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --chaton-template-id llama3

@hanishkvc
Copy link
Contributor Author

Hi @ggerganov @ngxson

Based on a further quick glance, I feel there is a bug in that server's web frontend is calling into /completions endpoint and not /chat/completions endpoint, even when chat option is selected.

Background

Given the strange behaviour I saw yesterday when trying to test examples/server with this PR's chaton-template-apply integrated, I noticed that the full chat transcript (without special tags) was being processed as a single prompt and inturn directly tokenized without chat-templating; instead of the expected behaviour of getting a array of chat-role+message objects and inturn running through chat templating and then tokenizing.

Wanted to be sure I am not missing something basic, so had a look at http api reference from openai site, which I assume is the convention followed by most llm web services. This is a assumption given that I look at LLMs only once in a bluemoon, that too more as a end user to see where it has reached.

However looking at what I am seeing this is what it appears to be.

Hi @ngxson,

I was trying to test how server would work, if it is updated to use the current version of this PR, inturn with the minimalist of changes. So I basically changed llama_chat_appy_template in llama.cpp to call my chaton_tmpl_apply_ex rather than llama_chat_apply_template_internal.

With that change, what I am noticing is that It appears like examples/server doesnt call llama_chat_apply_template beyond the initial generic test that it does before the actual user interaction. But once user puts some chat content, I see that HandleCompletion calls into ServerContextTokenize which inturn seems to directly tokenizes the user visible text (their own as well as model responses) directly without any of the special tokens wrt demarkating of system/user/model messages.

Am I missing something fundamental, or is it what the server code is currently setup to do, is it that I need to pass any additional argument beyond -m THE_MODEL.

The code I used with the above mentioned patch is at

https://github.com/hanishkvc/experiment-ai-tools-llama.cpp/tree/tag-experiment-20240517IST0016-server-chaton

bin/server -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --chaton-template-id llama3

@jukofyork
Copy link
Contributor

jukofyork commented Jul 3, 2024

Is this PR still active or any other PR to allow custom templates? It's the only thing I miss after ditching Ollama :/

Just discovered today that command-r models actually look to be able to use user --> system --> assistant -->... from this reddit post:

https://www.reddit.com/r/LocalLLaMA/comments/1du9ija/commandr_works_much_better_when_using_a_context/

But to test that it requires writing a custom clause in C++ that will have to be updated each pull :( This is probably a weird case and likely won't work with the "chat completion" API anyway, but there are definitely lots of other cases where small tweaks to the "official" template are very useful.

It doesn't look like 99% of the Jinga2 templates even use a fraction of the power of Jinga2, and even a tiny subset would work:

https://docs.apitemplate.io/reference/learn-jinja2.html#learn-the-templating-language-jinja-2-in-10-minutes

If we could scrape most of the available templates off huggingface then it would just be a case of improving your parser to handle more and more of the corner cases, and wouldn't be that hard IMO, nor require loads to C++ templates or regex code... It maybe wouldn't be all that robust compared to the full parser but it would be better than nothing.

I once had to parse millions of semi-broken "Portable Game Notation" and "Forsyth–Edwards Notation" chess data files and it really wasn't that bad to do - you just have to plod along getting the fraction of failures down until you get to the truly "WTF" files and call it a day.

@jukofyork
Copy link
Contributor

jukofyork commented Jul 3, 2024

The other alternative is just to implement some super-simple subset of:

https://pkg.go.dev/text/template (what Ollama uses)

https://github.com/antlr/stringtemplate4/blob/master/doc/cheatsheet.md (most popular Java template library)

If you accept it doesn't need to be as robust about error detection and reporting, it's really easy to implement something like this with nothing but recursion and a couple of string matching helper functions.


There's literally 100s of open source projects that do this too, ranging from C++ template-heavy / regex-heavy:

https://github.com/lexxmark/string_template

to barebones C:

https://github.com/cozis/tinytemplate

and anything like this would likely be able to accommodate our use case for the subset of Jinga2 used to write the real templates, and as Ollama has shown; just a couple of of added boolean variables (eg: is_first, is_last, etc) is enough to be able to write almost any compatible template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants