feat: update README.md

janhq · May 7, 2024 · 40410f4 · 40410f4
1 parent 19b9929
commit 40410f4
Showing 1 changed file with 64 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -28,6 +28,7 @@ If you don't have git, you can download the source code as a file archive from [
 ## Build library with server example
 - **On Windows**
   Install choco
+
   Install make
   ```
   choco install make -y
@@ -85,4 +86,66 @@ Finally, let's start Server.
   server.exe
   ```
 # Quickstart
-// TODO
+Step 1: Downloading a model
+```
+mkdir model && cd model
+wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true
+```
+
+Step 2: Start server example
+```
+server
+```
+
+Step 3: Load model
+```
+curl http://localhost:3928/loadmodel \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "llama_model_path": "./model/llama-2-7b-chat.Q5_K_M.gguf",
+    "model_alias": "llama-2-7b-chat.Q5_K_M",
+    "ctx_len": 512,
+    "ngl": 100,
+  }'
+```
+
+Step 4: Making an Inference
+```
+curl http://localhost:3928/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {
+        "role": "user",
+        "content": "Who won the world series in 2020?"
+      },
+    "model": "llama-2-7b-chat.Q5_K_M"
+    ]
+  }'
+```
+
+Table of parameters
+
+| Parameter        | Type    | Description                                                  |
+|------------------|---------|--------------------------------------------------------------|
+| `llama_model_path` | String  | The file path to the LLaMA model.                            |
+| `ngl`              | Integer | The number of GPU layers to use.                             |
+| `ctx_len`          | Integer | The context length for the model operations.                 |
+| `embedding`        | Boolean | Whether to use embedding in the model.                       |
+| `n_parallel`       | Integer | The number of parallel operations. |
+| `cont_batching`    | Boolean | Whether to use continuous batching.                          |
+| `user_prompt`      | String  | The prompt to use for the user.                              |
+| `ai_prompt`        | String  | The prompt to use for the AI assistant.                      |
+| `system_prompt`    | String  | The prompt to use for system rules.                          |
+| `pre_prompt`    | String  | The prompt to use for internal configuration.                          |
+| `cpu_threads`   | Integer | The number of threads to use for inferencing (CPU MODE ONLY) |
+| `n_batch`       | Integer | The batch size for prompt eval step |
+| `caching_enabled` | Boolean | To enable prompt caching or not   |
+|`grp_attn_n`|Integer|Group attention factor in self-extend|
+|`grp_attn_w`|Integer|Group attention width in self-extend|
+|`mlock`|Boolean|Prevent system swapping of the model to disk in macOS|
+|`grammar_file`| String |You can constrain the sampling using GBNF grammars by providing path to a grammar file|
+|`model_type` | String | Model type we want to use: llm or embedding, default value is llm|
+|`model_alias`| String | Alias for model, can be used as a model id if specify in loadmodel request |
+|`model` | String | Model name, can be used as a model id if specify in inference request |
+