Add additional chat templates to dllama-api #73

DifferentialityDevelopment · 2024-05-28T07:35:59Z

I've added a few of the most common chat templates, namely llama2, llama3, chatml and openchat.
This should make a lot more models compatible with distributed-llama's API

Also added an additional argument to AppArgs to let you specify the chat template used by the model on startup instead of on a per request basis.

DifferentialityDevelopment · 2024-05-28T07:55:49Z

Openchat has finetunes for llama2 and llama3 so I've just added openchat3 which should work with their llama 3 finetune, where as openchat should work with their llama 2 finetune

DifferentialityDevelopment · 2024-05-28T09:14:07Z

I was able to successfully test openchat-3.6-8b and it worked correctly with the openchat3 chat template
-> https://huggingface.co/openchat/openchat-3.6-8b-20240522
Converted it using convert-hf.py and converted the tokenizer with convert-tokenizer-llama3.py

./dllama-api.exe --model D:\openchat-3.6-8b-20240522-distributed\dllama_model_openchat-3.6-8b-20240522_q40.m --tokenizer D:\openchat-3.6-8b-20240522-distributed\dllama_tokenizer_llama3.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --chat-template openchat3 --port 10111
ƒÆí arch: llama
ƒÆí hiddenAct: silu
ƒÆí dim: 4096
ƒÆí hiddenDim: 14336
ƒÆí nLayers: 32
ƒÆí nHeads: 32
ƒÆí nKvHeads: 8
ƒÆí vocabSize: 128256
ƒÆí seqLen: 8192
ƒÆí nSlices: 1
ƒÆí ropeTheta: 500000.0
ƒôä bosId: 128000
ƒôä eosId: 128001
ƒòÆ ropeCache: 131072 kB
ÔÅ® Loaded 1981264 kB
Listening on 0.0.0.0:10111...
Server URL: http://127.0.0.1:10111/v1/
ƒöÀ POST /v1/chat/completions
ƒö©In gardens of beauty, roses stand tall,
Their vibrant hues, a sight to behold.
With petals of passion, they charm all,
And whispers of love, theyƒöÂ

(Excuse the weird characters, windows terminal can't render those symbols correctly.)

I was not able to test openchat-3.5 as although I could convert the model using convert-hf.py, I could not convert the tokenizer.
I don't think we have support for mistral yet, but will try and test mixtral with the chatml template, or using a finetune of llama3 that uses the chatml template

https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B uses chatml template so will test with that.

DifferentialityDevelopment · 2024-05-28T13:00:07Z

I am having the weirdest issue, if I run Hermes-2-Theta-Llama-3-8B using the llama 3 converted tokenizer, it works fine, although it is missing some tokens since it's from a different model of the same architecture, Hermes-2-Theta-Llama-3-8B doesn't have a tokenizer.model so I was a bit in a jam as to what to do.

I put together a convert-tokenizer-hf.py script that's meant to do the same as convert-tokenizer-llama3.py except it uses transformers AutoTokenizer to pull all the necessary data to build the tokenizer.t file

But I think I am doing something wrong as when I run dllama with the generated tokenizer.t file it crashes when encoding text.

convert-tokenizer-hf.py

import sys
import struct
from transformers import AutoTokenizer

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print('Invalid usage')
        exit(1)

    tokenizer_path = sys.argv[1]
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

    bos_token = tokenizer.bos_token
    eos_token = tokenizer.eos_token

    bosId = tokenizer.convert_tokens_to_ids(bos_token)
    eosId = tokenizer.convert_tokens_to_ids(eos_token)

    tokens = []
    scores = []
    
    # Get vocab size and tokens
    for token_id in range(tokenizer.vocab_size):
        token = tokenizer.convert_ids_to_tokens(token_id)
        bytes = token.encode('utf-8')
        tokens.append(bytes)
        scores.append(float(token_id))

    # Get special tokens
    special_tokens = tokenizer.added_tokens_decoder
    special_token_index = tokenizer.vocab_size
    for token_id in special_tokens:
        token = tokenizer.convert_ids_to_tokens(token_id)
        bytes = token.encode('utf-8')
        score = special_token_index
        tokens.append(bytes)
        scores.append(float(score))
        special_token_index += 1

    vocab_size = len(tokens)
    max_token_length = max(len(t) for t in tokens)

    with open('dllama_tokenizer_llama3.t', 'wb') as outputFile:
        outputFile.write(struct.pack('IIIiii',
                                     0x567123,
                                     vocab_size,
                                     max_token_length,
                                     bosId,
                                     eosId,
                                     -1))

        for i in range(vocab_size):
            outputFile.write(struct.pack('fI', scores[i], len(tokens[i])))
            outputFile.write(tokens[i])

        print(f'maxTokenLength={max_token_length}')
        print(f'bosId={bosId}')
        print(f'eosId={eosId}')
        print(f'vocabSize={vocab_size}')

b4rtaz · 2024-05-28T17:23:02Z

I'm wondering if this is a good direction. I mean for sure the source code should not include all possible templates. Maybe this is something that should be moved to the tokenizer file.

chat_role_start llama3 = <|start_header_id|>
chat_role_end llama3 = <|end_header_id|>
chat_eos llama3 = <|eot_id|>

So this design assumes there may be differences in the chat mode.

At the end the converter would be responsible for setting correct values. So this would be not a responsibility of DL.

WDYT?

DifferentialityDevelopment · 2024-05-28T18:43:53Z

I've tried to type a reply twice but keep getting an blue screen just as I'm about to send :/

Converting the tokenizer is very quick so, in the long run it's probably good to use that route, I just wanted to add a few of the common chat templates ie llama 2, llama 3 and chatml as that already covers the majority of models.

The bigger issue I have is with the script I showed above, I cannot create tokenizers for some models as they do not have the tokenizer.model file, so I tried creating something to convert using AutoTokenizer but the converted tokenizer doesn't work for some reason, dllama error's out at tokenizer.cpp line 202, for instance this model: https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B

b4rtaz · 2024-05-30T15:09:32Z

@DifferentialityDevelopment please check this PR. This may solve the problem for different models.

b4rtaz · 2024-05-30T15:12:36Z

@DifferentialityDevelopment probably this would require updating the tokenizer in your repository on HuggingFace. Please, don't do this until the PR is not merged. Later, I want to test a different model.

b4rtaz · 2024-05-31T20:50:39Z

@DifferentialityDevelopment ~~can you update the tokenizer file in your HF repository to the new format?~~

Added a few more chat templates

9d38dc1

Added template for openchat 8b (llama 3)

00021ca

b4rtaz closed this May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional chat templates to dllama-api #73

Add additional chat templates to dllama-api #73

DifferentialityDevelopment commented May 28, 2024

DifferentialityDevelopment commented May 28, 2024 •

edited

Loading

DifferentialityDevelopment commented May 28, 2024 •

edited

Loading

DifferentialityDevelopment commented May 28, 2024 •

edited

Loading

b4rtaz commented May 28, 2024

DifferentialityDevelopment commented May 28, 2024 •

edited

Loading

b4rtaz commented May 30, 2024

b4rtaz commented May 30, 2024 •

edited

Loading

b4rtaz commented May 31, 2024 •

edited

Loading

Add additional chat templates to dllama-api #73

Add additional chat templates to dllama-api #73

Conversation

DifferentialityDevelopment commented May 28, 2024

DifferentialityDevelopment commented May 28, 2024 • edited Loading

DifferentialityDevelopment commented May 28, 2024 • edited Loading

DifferentialityDevelopment commented May 28, 2024 • edited Loading

b4rtaz commented May 28, 2024

DifferentialityDevelopment commented May 28, 2024 • edited Loading

b4rtaz commented May 30, 2024

b4rtaz commented May 30, 2024 • edited Loading

b4rtaz commented May 31, 2024 • edited Loading

DifferentialityDevelopment commented May 28, 2024 •

edited

Loading

DifferentialityDevelopment commented May 28, 2024 •

edited

Loading

DifferentialityDevelopment commented May 28, 2024 •

edited

Loading

DifferentialityDevelopment commented May 28, 2024 •

edited

Loading

b4rtaz commented May 30, 2024 •

edited

Loading

b4rtaz commented May 31, 2024 •

edited

Loading