Ask Poddy (named after "Poddy", the RunPod bot on Discord) is a user-friendly RAG (Retrieval-Augmented Generation) web application designed to showcase the ease of setting up OpenAI-compatible APIs using open-source models running serverless on RunPod. Built with Next.js, React, Tailwind, Vercel AI SDK, and LangChain, it uses Meta-Llama-3-8B-Instruct as LLM and multilingual-e5-large-instruct for text embeddings.
This tutorial will guide you through deploying Ask Poddy in your environment, enabling it to answer questions related to RunPod effectively, by leveraging the open-source workers worker-vllm and worker-infinity-embedding.
Ask Poddy is designed to demonstrate the integration of serverless OpenAI-compatible APIs with open-source models. The application runs locally (but it could also be deployed into the cloud), while the computational heavy lifting is handled by serverless endpoints on RunPod. This architecture allows seamless use of existing OpenAI-compatible tools and frameworks without needing to develop custom APIs.
Here's how RAG works in Ask Poddy:
- User: Asks a question.
- Vector Store: The question is sent to LangChain, which uses the worker-infinity-embedding endpoint to convert the question into an embedding using the multilingual-e5-large-instruct model.
- Vector Store: Performs a similarity search to find relevant documents based on the question.
- AI SDK: The retrieved documents and the user's question are sent to the worker-vllm endpoint.
- worker-vllm: Generates an answer using the Meta-Llama-3-8B-Instruct model.
- User: Receives the answer.
Tip
You can choose any of the supported models that come with vLLM.
- git installed
- Node.js and npm installed
- RunPod account
- Clone the Ask Poddy repository and go into the cloned directory:
git clone https://github.com/blib-la/ask-poddy.git
cd ask-poddy
- Clone the RunPod docs repository into
ask-poddy/data/runpod-docs
.
git clone https://github.com/runpod/docs.git ./data/runpod-docs
Note
The RunPod docs repository contains the RunPod documentation that Ask Poddy will use to answer questions.
- Copy the
img
folder from./data/runpod-docs/static/img
to./public
Note
This makes it possible for Ask Poddy to include images from the RunPod documentation.
Navigate to the ask-poddy
directory and install the dependencies:
npm install
- Create two network volumes with 15GB storage each in the same data center as the serverless
endpoints.
- Volume for embeddings:
infinity_embeddings
- Volume for LLM:
vllm_llama3
- Volume for embeddings:
Note
Using network volumes ensures that the models and embeddings are stored persistently, allowing for faster subsequent requests as the data does not need to be downloaded or recreated each time.
- Follow the guide for setting up the vLLM endpoint,
but make sure to use the
meta-llama/Meta-Llama-3-8B-Instruct
model instead of the one mentioned in the guide. And also make sure to select the network volumevllm_llama3
when creating the endpoint.
Tip
The worker is using worker-vllm.
- Create a new template
- Use the Docker image
runpod/worker-infinity-embedding:stable-cuda12.1.0
from worker-infinity-embedding and set the environment variableMODEL_NAMES
tointfloat/multilingual-e5-large-instruct
. - Create a serverless endpoint
and make sure to select the network volume
infinity_embeddings
.
- Generate your RunPod API key
- Find the endpoint IDs underneath the deployed serverless endpoints.
- Create your
.env.local
based on .env.local.example or by creating a file with the following variables:
RUNPOD_API_KEY=your_runpod_api_key
RUNPOD_ENDPOINT_ID_VLLM=your_vllm_endpoint_id
RUNPOD_ENDPOINT_ID_EMBEDDING=your_embedding_endpoint_id
To populate the vector store, run the following command:
npm run populate
Note
The first run will take some time as the worker downloads the embeddings model (multilingual-e5-large-instruct). Subsequent requests will use the downloaded model stored in the network volume.
This command reads all markdown documents from the ask-poddy/data/runpod-docs/
folder, creates
embeddings using the embedding endpoint running on RunPod, and stores these embeddings in the local
vector store:
- Documents: The markdown documents from the
ask-poddy/data/runpod-docs/
folder are read by LangChain. - Chunks: LangChain converts the documents into smaller chunks, which are then sent to the
worker-infinity-embedding
endpoint. - worker-infinity-embedding: Receives chunks, generates embeddings using the
multilingual-e5-large-instruct
model, and sends them back. - Vector Store: LangChain saves these embeddings in the local vector store (
HNSWlib
).
Tip
A vector store is a database that stores embeddings (vector representations of text) to enable efficient similarity search. This is crucial for the RAG process as it allows the system to quickly retrieve relevant documents based on the user's question.
- Start the local web server:
npm run dev
- Open http://localhost:3000 to access the UI.
Now that everything is running, you can ask your RunPod-related question, like:
- What is RunPod?
- How do I create a serverless endpoint?
- What are the benefits of using a network volume?
- How can I become a host for the community cloud?
- Can RunPod help my startup to get going?
Note
The first run will take some time as the worker downloads the LLM (Meta-Llama-3-8B-Instruct). Subsequent requests will use the downloaded model stored in the network volume.