Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional chat templates to dllama-api #73

Conversation

DifferentialityDevelopment
Copy link
Contributor

I've added a few of the most common chat templates, namely llama2, llama3, chatml and openchat.
This should make a lot more models compatible with distributed-llama's API

Also added an additional argument to AppArgs to let you specify the chat template used by the model on startup instead of on a per request basis.

@DifferentialityDevelopment
Copy link
Contributor Author

DifferentialityDevelopment commented May 28, 2024

Openchat has finetunes for llama2 and llama3 so I've just added openchat3 which should work with their llama 3 finetune, where as openchat should work with their llama 2 finetune

@DifferentialityDevelopment
Copy link
Contributor Author

DifferentialityDevelopment commented May 28, 2024

I was able to successfully test openchat-3.6-8b and it worked correctly with the openchat3 chat template
-> https://huggingface.co/openchat/openchat-3.6-8b-20240522
Converted it using convert-hf.py and converted the tokenizer with convert-tokenizer-llama3.py

./dllama-api.exe --model D:\openchat-3.6-8b-20240522-distributed\dllama_model_openchat-3.6-8b-20240522_q40.m --tokenizer D:\openchat-3.6-8b-20240522-distributed\dllama_tokenizer_llama3.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --chat-template openchat3 --port 10111
­ƒÆí arch: llama
­ƒÆí hiddenAct: silu
­ƒÆí dim: 4096
­ƒÆí hiddenDim: 14336
­ƒÆí nLayers: 32
­ƒÆí nHeads: 32
­ƒÆí nKvHeads: 8
­ƒÆí vocabSize: 128256
­ƒÆí seqLen: 8192
­ƒÆí nSlices: 1
­ƒÆí ropeTheta: 500000.0
­ƒôä bosId: 128000
­ƒôä eosId: 128001
­ƒòÆ ropeCache: 131072 kB
ÔÅ® Loaded 1981264 kB
Listening on 0.0.0.0:10111...
Server URL: http://127.0.0.1:10111/v1/
­ƒöÀ POST /v1/chat/completions
­ƒö©In gardens of beauty, roses stand tall,
Their vibrant hues, a sight to behold.
With petals of passion, they charm all,
And whispers of love, they­ƒöÂ

(Excuse the weird characters, windows terminal can't render those symbols correctly.)

I was not able to test openchat-3.5 as although I could convert the model using convert-hf.py, I could not convert the tokenizer.
I don't think we have support for mistral yet, but will try and test mixtral with the chatml template, or using a finetune of llama3 that uses the chatml template

https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B uses chatml template so will test with that.

@DifferentialityDevelopment
Copy link
Contributor Author

DifferentialityDevelopment commented May 28, 2024

I am having the weirdest issue, if I run Hermes-2-Theta-Llama-3-8B using the llama 3 converted tokenizer, it works fine, although it is missing some tokens since it's from a different model of the same architecture, Hermes-2-Theta-Llama-3-8B doesn't have a tokenizer.model so I was a bit in a jam as to what to do.

I put together a convert-tokenizer-hf.py script that's meant to do the same as convert-tokenizer-llama3.py except it uses transformers AutoTokenizer to pull all the necessary data to build the tokenizer.t file

But I think I am doing something wrong as when I run dllama with the generated tokenizer.t file it crashes when encoding text.

convert-tokenizer-hf.py

import sys
import struct
from transformers import AutoTokenizer

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print('Invalid usage')
        exit(1)

    tokenizer_path = sys.argv[1]
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

    bos_token = tokenizer.bos_token
    eos_token = tokenizer.eos_token

    bosId = tokenizer.convert_tokens_to_ids(bos_token)
    eosId = tokenizer.convert_tokens_to_ids(eos_token)

    tokens = []
    scores = []
    
    # Get vocab size and tokens
    for token_id in range(tokenizer.vocab_size):
        token = tokenizer.convert_ids_to_tokens(token_id)
        bytes = token.encode('utf-8')
        tokens.append(bytes)
        scores.append(float(token_id))

    # Get special tokens
    special_tokens = tokenizer.added_tokens_decoder
    special_token_index = tokenizer.vocab_size
    for token_id in special_tokens:
        token = tokenizer.convert_ids_to_tokens(token_id)
        bytes = token.encode('utf-8')
        score = special_token_index
        tokens.append(bytes)
        scores.append(float(score))
        special_token_index += 1

    vocab_size = len(tokens)
    max_token_length = max(len(t) for t in tokens)

    with open('dllama_tokenizer_llama3.t', 'wb') as outputFile:
        outputFile.write(struct.pack('IIIiii',
                                     0x567123,
                                     vocab_size,
                                     max_token_length,
                                     bosId,
                                     eosId,
                                     -1))

        for i in range(vocab_size):
            outputFile.write(struct.pack('fI', scores[i], len(tokens[i])))
            outputFile.write(tokens[i])

        print(f'maxTokenLength={max_token_length}')
        print(f'bosId={bosId}')
        print(f'eosId={eosId}')
        print(f'vocabSize={vocab_size}')

@b4rtaz
Copy link
Owner

b4rtaz commented May 28, 2024

I'm wondering if this is a good direction. I mean for sure the source code should not include all possible templates. Maybe this is something that should be moved to the tokenizer file.

Basically now the tokenizer contains: magic|n_words|max_token_length|bos_id|eos_id|pad_id|<dictionary>. But we could add a new optional fields like:

  • chat_role_start llama3 = <|start_header_id|>
  • chat_role_end llama3 = <|end_header_id|>
  • chat_eos llama3 = <|eot_id|>

So this design assumes there may be differences in the chat mode.

At the end the converter would be responsible for setting correct values. So this would be not a responsibility of DL.

WDYT?

@DifferentialityDevelopment
Copy link
Contributor Author

DifferentialityDevelopment commented May 28, 2024

I've tried to type a reply twice but keep getting an blue screen just as I'm about to send :/

Converting the tokenizer is very quick so, in the long run it's probably good to use that route, I just wanted to add a few of the common chat templates ie llama 2, llama 3 and chatml as that already covers the majority of models.

The bigger issue I have is with the script I showed above, I cannot create tokenizers for some models as they do not have the tokenizer.model file, so I tried creating something to convert using AutoTokenizer but the converted tokenizer doesn't work for some reason, dllama error's out at tokenizer.cpp line 202, for instance this model: https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B

@b4rtaz
Copy link
Owner

b4rtaz commented May 30, 2024

@DifferentialityDevelopment please check this PR. This may solve the problem for different models.

@b4rtaz b4rtaz closed this May 30, 2024
@b4rtaz
Copy link
Owner

b4rtaz commented May 30, 2024

@DifferentialityDevelopment probably this would require updating the tokenizer in your repository on HuggingFace. Please, don't do this until the PR is not merged. Later, I want to test a different model.

@b4rtaz
Copy link
Owner

b4rtaz commented May 31, 2024

@DifferentialityDevelopment can you update the tokenizer file in your HF repository to the new format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants