2024OSPP——基于AI网关实现AI模型的轻量化部署 #1

YJQ1101 · 2024-08-28T11:09:18Z

简介

该功能实现在诸如个人电脑、普通的云主机上基于AI模型提供AI Web应用能力，具体来说，支持从本地文件加载模型，并解析HTTP请求，调用模型生成结果响应。https://summer-ospp.ac.cn/org/prodetail/241f80032?lang=zh&list=pro

实现

实现了一个 HTTP Filter，该filter会解析推理请求，并调用异步推理线程实现推理过程，同时给该异步线程一个回调函数，实现envoy的流式传输。

编译与运行

bazel build //contrib/exe:envoy-static #编译
bazel-bin/contrib/exe/envoy-static -c "envoy.yaml"  --concurrency 1 #运行

配置使用

filter级配置：

- name: envoy.filters.http.llm_inference
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.llm_inference.v3.modelParameter
    n_threads : 100
    n_parallel : 5
    modelpath: {
      "qwen2": "/home/yuanjq/model/qwen2-7b-instruct-q5_k_m.gguf",
      "llama3": "/home/yuanjq/model/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
      "bge": "/home/yuanjq/model/bge-small-zh-v1.5-f32.gguf"
    }

n_threads: 表示推理线程能用的最大线程数
n_parallel: 表示推理服务的最大并行请求数
modelpath: 表示模型本地路径

router级配置：

route_config:
  name: route
  virtual_hosts:
  - name: llm_inference_service
    domains: ["api.openai.com"]
    routes:
    - match:
        prefix: "/v1/chat/completions"
      typed_per_filter_config:
        envoy.filters.http.llm_inference:
          "@type": type.googleapis.com/envoy.extensions.filters.http.llm_inference.v3.modelChosen
          usemodel: "qwen2"
          first_byte_timeout : 4
          inference_timeout : 90
          embedding : false
      direct_response:
        status: 504
        body:
          inline_string: "inference timeout"

usemodel: 表示使用的模型，模型名字与modelpath里面设置的要对应
first_byte_timeout: 表示首字节超时时间
inference_timeout: 表示总推理超时时间
embedding: 表示该模型是否是embedding模型

支持模型

效果图

CLAassistant · 2024-08-28T11:09:25Z

All committers have signed the CLA.

johnlanni · 2024-09-11T09:38:57Z

contrib/llm_inference/filters/http/source/config.cc

+ LLMInferenceFilterConfigPerRouteSharedPtr config = 
+ std::make_shared<LLMInferenceFilterConfigPerRoute>(LLMInferenceFilterConfigPerRoute(proto_config));
+
+ model_Chosen_ = config->modelChosen();


Suggested change

model_Chosen_ = config->modelChosen();

model_chosen_ = config->modelChosen();

johnlanni · 2024-09-11T09:39:40Z

contrib/llm_inference/filters/http/source/config.cc

+ if (modelpath.contains(model_Chosen_.model_name)) {
+ ctx = inference->load(inference, config->modelParameter(), model_Chosen_, modelpath[model_Chosen_.model_name]);


Suggested change

if (modelpath.contains(model_Chosen_.model_name)) {

ctx = inference->load(inference, config->modelParameter(), model_Chosen_, modelpath[model_Chosen_.model_name]);

if (modelpath.contains(model_chosen_.model_name)) {

ctx = inference->load(inference, config->modelParameter(), model_chosen_, modelpath[model_chosen_.model_name]);

johnlanni · 2024-09-11T09:40:01Z

contrib/llm_inference/filters/http/source/config.h

+ const envoy::extensions::filters::http::llm_inference::v3::modelChosen& proto_config,
+ Server::Configuration::ServerFactoryContext&, ProtobufMessage::ValidationVisitor&) override;
+
+ ModelChosen model_Chosen_;


Suggested change

ModelChosen model_Chosen_;

ModelChosen model_chosen_;

johnlanni · 2024-09-11T09:46:43Z

contrib/llm_inference/filters/http/source/llm_inference_filter.cc

+
+LLMInferenceFilterConfig::LLMInferenceFilterConfig(
+ const envoy::extensions::filters::http::llm_inference::v3::modelParameter& proto_config)
+ : modelParameter_{proto_config.n_threads(), proto_config.n_parallel()},


变量命名都注意一下，成员变量都用下划线风格，不用驼峰

johnlanni · 2024-09-11T10:02:24Z

contrib/llm_inference/filters/http/source/config.cc

+ std::shared_ptr<InferenceContext> load(std::shared_ptr<InferenceSingleton> singleton, const ModelParameter& model_parameter,
+ const ModelChosen& model_chosen, const std::string& model_path) {
+ std::shared_ptr<InferenceContext> ctx;
+ absl::MutexLock lock(&mu_);


应该只会在配置解析阶段发生load？那都是是主线程做的，不需要加锁

johnlanni · 2024-09-11T10:09:06Z

contrib/llm_inference/filters/http/source/config.cc

+
+ InferenceContextSharedPtr ctx;
+ auto modelpath = config->modelPath();
+ if (modelpath.contains(model_Chosen_.model_name)) {


如果多个路由用了不同chosen，这里会有问题，只会加载最后一个model。我觉得这里可以直接load所有配置的模型。问题不大的。因为是用户自己配置的。

…uration

YJQ1101 added 5 commits August 26, 2024 07:50

feature: llm_inference_filter

60ea17e

add llama.cpp module

d261d60

//contrib/envoy/extensions/upstreams/http/dubbo_tcp/v3:pkg in place

a65e32b

clear yaml file and add test file

0d5f3bb

apply context-shift if needed

77ae6be

Modify path matching to suffix matching and fix sendError bug

a4eace8

johnlanni reviewed Sep 11, 2024

View reviewed changes

Load the model based on filter configuration instead of router config…

9acc597

…uration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024OSPP——基于AI网关实现AI模型的轻量化部署 #1

2024OSPP——基于AI网关实现AI模型的轻量化部署 #1

YJQ1101 commented Aug 28, 2024 •

edited

Loading

CLAassistant commented Aug 28, 2024 •

edited

Loading

johnlanni Sep 11, 2024

johnlanni Sep 11, 2024

johnlanni Sep 11, 2024

johnlanni Sep 11, 2024

johnlanni Sep 11, 2024

johnlanni Sep 11, 2024

	model_Chosen_ = config->modelChosen();
	model_chosen_ = config->modelChosen();

		if (modelpath.contains(model_Chosen_.model_name)) {
		ctx = inference->load(inference, config->modelParameter(), model_Chosen_, modelpath[model_Chosen_.model_name]);

2024OSPP——基于AI网关实现AI模型的轻量化部署 #1

Are you sure you want to change the base?

2024OSPP——基于AI网关实现AI模型的轻量化部署 #1

Conversation

YJQ1101 commented Aug 28, 2024 • edited Loading

简介

实现

编译与运行

配置使用

支持模型

效果图

CLAassistant commented Aug 28, 2024 • edited Loading

johnlanni Sep 11, 2024

Choose a reason for hiding this comment

johnlanni Sep 11, 2024

Choose a reason for hiding this comment

johnlanni Sep 11, 2024

Choose a reason for hiding this comment

johnlanni Sep 11, 2024

Choose a reason for hiding this comment

johnlanni Sep 11, 2024

Choose a reason for hiding this comment

johnlanni Sep 11, 2024

Choose a reason for hiding this comment

YJQ1101 commented Aug 28, 2024 •

edited

Loading

CLAassistant commented Aug 28, 2024 •

edited

Loading