Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024OSPP——基于AI网关实现AI模型的轻量化部署 #1

Open
wants to merge 7 commits into
base: envoy-1.27
Choose a base branch
from

Conversation

YJQ1101
Copy link

@YJQ1101 YJQ1101 commented Aug 28, 2024

简介

该功能实现在诸如个人电脑、普通的云主机上基于AI模型提供AI Web应用能力,具体来说,支持从本地文件加载模型,并解析HTTP请求,调用模型生成结果响应。https://summer-ospp.ac.cn/org/prodetail/241f80032?lang=zh&list=pro

实现

实现了一个 HTTP Filter,该filter会解析推理请求,并调用异步推理线程实现推理过程,同时给该异步线程一个回调函数,实现envoy的流式传输。

编译与运行
bazel build //contrib/exe:envoy-static #编译
bazel-bin/contrib/exe/envoy-static -c "envoy.yaml"  --concurrency 1 #运行
配置使用

filter级配置:

- name: envoy.filters.http.llm_inference
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.llm_inference.v3.modelParameter
    n_threads : 100
    n_parallel : 5
    modelpath: {
      "qwen2": "/home/yuanjq/model/qwen2-7b-instruct-q5_k_m.gguf",
      "llama3": "/home/yuanjq/model/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
      "bge": "/home/yuanjq/model/bge-small-zh-v1.5-f32.gguf"
    }

n_threads: 表示推理线程能用的最大线程数
n_parallel: 表示推理服务的最大并行请求数
modelpath: 表示模型本地路径

router级配置:

route_config:
  name: route
  virtual_hosts:
  - name: llm_inference_service
    domains: ["api.openai.com"]
    routes:
    - match:
        prefix: "/v1/chat/completions"
      typed_per_filter_config:
        envoy.filters.http.llm_inference:
          "@type": type.googleapis.com/envoy.extensions.filters.http.llm_inference.v3.modelChosen
          usemodel: "qwen2"
          first_byte_timeout : 4
          inference_timeout : 90
          embedding : false
      direct_response:
        status: 504
        body:
          inline_string: "inference timeout"

usemodel: 表示使用的模型,模型名字与modelpath里面设置的要对应
first_byte_timeout: 表示首字节超时时间
inference_timeout: 表示总推理超时时间
embedding: 表示该模型是否是embedding模型

支持模型
效果图

2024-08-2900 29 48-ezgif com-video-to-gif-converter

@CLAassistant
Copy link

CLAassistant commented Aug 28, 2024

CLA assistant check
All committers have signed the CLA.

LLMInferenceFilterConfigPerRouteSharedPtr config =
std::make_shared<LLMInferenceFilterConfigPerRoute>(LLMInferenceFilterConfigPerRoute(proto_config));

model_Chosen_ = config->modelChosen();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_Chosen_ = config->modelChosen();
model_chosen_ = config->modelChosen();

Comment on lines 54 to 55
if (modelpath.contains(model_Chosen_.model_name)) {
ctx = inference->load(inference, config->modelParameter(), model_Chosen_, modelpath[model_Chosen_.model_name]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (modelpath.contains(model_Chosen_.model_name)) {
ctx = inference->load(inference, config->modelParameter(), model_Chosen_, modelpath[model_Chosen_.model_name]);
if (modelpath.contains(model_chosen_.model_name)) {
ctx = inference->load(inference, config->modelParameter(), model_chosen_, modelpath[model_chosen_.model_name]);

const envoy::extensions::filters::http::llm_inference::v3::modelChosen& proto_config,
Server::Configuration::ServerFactoryContext&, ProtobufMessage::ValidationVisitor&) override;

ModelChosen model_Chosen_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ModelChosen model_Chosen_;
ModelChosen model_chosen_;


LLMInferenceFilterConfig::LLMInferenceFilterConfig(
const envoy::extensions::filters::http::llm_inference::v3::modelParameter& proto_config)
: modelParameter_{proto_config.n_threads(), proto_config.n_parallel()},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量命名都注意一下,成员变量都用下划线风格,不用驼峰

std::shared_ptr<InferenceContext> load(std::shared_ptr<InferenceSingleton> singleton, const ModelParameter& model_parameter,
const ModelChosen& model_chosen, const std::string& model_path) {
std::shared_ptr<InferenceContext> ctx;
absl::MutexLock lock(&mu_);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该只会在配置解析阶段发生load?那都是是主线程做的,不需要加锁


InferenceContextSharedPtr ctx;
auto modelpath = config->modelPath();
if (modelpath.contains(model_Chosen_.model_name)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果多个路由用了不同chosen,这里会有问题,只会加载最后一个model。我觉得这里可以直接load所有配置的模型。问题不大的。因为是用户自己配置的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants