Skip to content

Commit

Permalink
feat: update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
sangjanai committed May 7, 2024
1 parent 19b9929 commit 40410f4
Showing 1 changed file with 64 additions and 1 deletion.
65 changes: 64 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ If you don't have git, you can download the source code as a file archive from [
## Build library with server example
- **On Windows**
Install choco

Install make
```
choco install make -y
Expand Down Expand Up @@ -85,4 +86,66 @@ Finally, let's start Server.
server.exe
```
# Quickstart
// TODO
Step 1: Downloading a model
```
mkdir model && cd model
wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true
```

Step 2: Start server example
```
server
```

Step 3: Load model
```
curl http://localhost:3928/loadmodel \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "./model/llama-2-7b-chat.Q5_K_M.gguf",
"model_alias": "llama-2-7b-chat.Q5_K_M",
"ctx_len": 512,
"ngl": 100,
}'
```

Step 4: Making an Inference
```
curl http://localhost:3928/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Who won the world series in 2020?"
},
"model": "llama-2-7b-chat.Q5_K_M"
]
}'
```

Table of parameters

| Parameter | Type | Description |
|------------------|---------|--------------------------------------------------------------|
| `llama_model_path` | String | The file path to the LLaMA model. |
| `ngl` | Integer | The number of GPU layers to use. |
| `ctx_len` | Integer | The context length for the model operations. |
| `embedding` | Boolean | Whether to use embedding in the model. |
| `n_parallel` | Integer | The number of parallel operations. |
| `cont_batching` | Boolean | Whether to use continuous batching. |
| `user_prompt` | String | The prompt to use for the user. |
| `ai_prompt` | String | The prompt to use for the AI assistant. |
| `system_prompt` | String | The prompt to use for system rules. |
| `pre_prompt` | String | The prompt to use for internal configuration. |
| `cpu_threads` | Integer | The number of threads to use for inferencing (CPU MODE ONLY) |
| `n_batch` | Integer | The batch size for prompt eval step |
| `caching_enabled` | Boolean | To enable prompt caching or not |
|`grp_attn_n`|Integer|Group attention factor in self-extend|
|`grp_attn_w`|Integer|Group attention width in self-extend|
|`mlock`|Boolean|Prevent system swapping of the model to disk in macOS|
|`grammar_file`| String |You can constrain the sampling using GBNF grammars by providing path to a grammar file|
|`model_type` | String | Model type we want to use: llm or embedding, default value is llm|
|`model_alias`| String | Alias for model, can be used as a model id if specify in loadmodel request |
|`model` | String | Model name, can be used as a model id if specify in inference request |

0 comments on commit 40410f4

Please sign in to comment.