This project shows how to run an open-source Large Language Model (LLM), specifically the Qwen3-Coder-30B-A3B-Instruct, on enterprise-grade infrastructure. In this case, we will use it for a coding assistant scenario. This setup provides a complete end-to-end pipeline: from provisioning GPU-backed Kubernetes clusters to serving the model via a secure API and connecting it to real-world developer tools like OpenCode and OpenWebUI.
- Model: The Qwen3-Coder-30B-A3B-Instruct, a model optimized for coding tasks that balances high capabilities with lightweight resource requirements.
- Orchestration (KubeRay): A Kubernetes operator that manages Ray Clusters and Ray Services, providing the framework needed to schedule, deploy, and serve LLMs at scale.
- Infrastructure (Akamai Cloud): High-performance NVIDIA Blackwell or Ada GPU nodes on Linode Kubernetes Engine (LKE).
- Networking (Istio & Gateway API): Modern ingress management using Istio and the Kubernetes Gateway API to route and secure traffic to the LLM service.
- Clients: Integration with OpenCode for AI-powered IDE features and OpenWebUI for a familiar, browser-based chat interface.
Ray is an open-source unified compute framework that simplifies the process of scaling Python and AI workloads. Running LLMs often requires distributed computing across multiple GPUs; Ray handles this complexity (such as tensor parallelism) natively.
KubeRay brings this power to Kubernetes. By using the KubeRay operator, you can manage your AI infrastructure as code, allowing for:
- Simplified Deployment: Using custom resources like RayService to define the model's environment and scaling logic in a single YAML file.
- Elastic Scaling: Automatically adjusting the number of worker replicas based on request traffic to optimize resource usage.
- Resiliency: Leveraging Kubernetes' self-healing capabilities to ensure your LLM endpoints remain highly available.
- A HuggingFace account, to download LLM weights.
- An Akamai Cloud (formerly Linode) account with access to GPUs. For NVIDIA RTX 4000 Ada GPUs, see here. For NVIDIA Blackwell, request access here.
- Opencode installed on your device.
- Helm and kubectl installed on your device.
- Linode CLI installed on your device.
ALL code samples are available at https://github.com/akamai-developers/kuberay-gpu-llm-quickstart
git clone <repository-url> kuberay-gpu-llm-quickstart
cd kuberay-gpu-llm-quickstartIn the Akamai Cloud Console, create a Linode API key with read/write permissions for Kubernetes, and NodeBalancers, and read permissions for events.

Create a LKE Cluster with two node pools - one with a blackwell-gpu, and one with a standard linode type, to run our other workloads on
export LINODE_CLI_TOKEN=<token from previous step>
linode-cli lke cluster-create \
--k8s_version 1.34 \
--label myllm \
--region us-sea \
--tier standard \
--control_plane.high_availability true \
--node_pools.count 1 \
--node_pools.type g3-gpu-rtxpro6000-blackwell-2 \
--node_pools.count 3 \
--node_pools.type g6-standard-4 \
--json | jq -r '.[].id'
export CLUSTER_ID=<id from the previous command>If you are using Ada GPUs, replace 3-gpu-rtxpro6000-blackwell-2 with g2-gpu-rtx4000a4-m in the command above.
Wait for the cluster’s kubeconfig to be ready, and save it.
linode-cli lke kubeconfig-view $CLUSTER_ID --json | jq -r '.[].kubeconfig' | base64 -d > kubeconfig.yaml Then Install nvidia gpu operator. This provisions and configures the GPU drivers, and the Nvidia device plugin. This step allows us to use the GPU within Kubernetes.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.10.0helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.5.0.
helm install --wait kuberay-operator kuberay/kuberay-operator --version 1.5.0Verify that the operator is running ok
kubectl get po
NAME READY STATUS RESTARTS AGE
kuberay-operator-5c575cccb6-b99wj 1/1 Running 0 12mGateway API is the modern way of doing ingress to services in a Kubernetes cluster. The CRDs give us the primitives we need
kubectl get crd gateways.gateway.networking.k8s.io &> /dev/null || \
{ kubectl kustomize "github.com/kubernetes-sigs/gateway-api/config/crd?ref=v1.4.0" | kubectl apply -f -; }Install Istio using helm. Istio is going to be the “Gateway Controller” which takes the Gateway custom resources and creates a linode NodeBalancer to route traffic from outside the cluster to the LLM running inside the cluster
helm repo add istio https://istio-release.storage.googleapis.com/charts
helm repo update
helm install --wait istio-base istio/base -n istio-system --set defaultRevision=default --create-namespace
helm install --wait istiod istio/istiod -n istio-system
HuggingFace allows you to download models from multiple providers - it is similar to dockerHub, but for Machine Learning Models. Create an API key to download models
Once you have the hugging face token, set it in your environment. Also generate a random key - this key will be used to secure traffic to the LLM we deploy. Choose this at random, and export it into the environment.
echo "export HF_TOKEN=<yourtokenhere>" > .envrc
echo "export OPEN_API_KEY=<your chosen key here>" > .envrcNow we’re all set to deploy things to the cluster! Deploy the ray-serve config - this deploys the model qwen-coder-30b. Qwen is a family of models developed by Alibaba. The 30b coder model is optimized for coding tasks, but isn’t super huge that you need a GPU farm to run. It strikes a balance between capabilities and being lightweight.
This step will take some time, so this will be a great time to grab a cup of your favorite beverage 😀
source .envrc
export KUBECONFIG=kubeconfig.yaml
kustomize build manifests | envsubst | kubectl apply -f -Wait for it to be healthy - check status for the model
kubectl describe rayservice ray-serve-llmKubeRay allows us to view the status and inspect the state of the cluster - the admin interface can be accessed using a port-forward
kubectl port-forward svc/ray-serve-llm-head-svc 8265Test your model by sending some test messages
SERVICE_IP=$(kubectl get svc llm-gateway-istio -o yaml | yq -r '.status.loadBalancer.ingress[0].ip')
curl --location "http://$SERVICE_IP/v1/chat/completions" --header "Authorization: Bearer $OPEN_API_KEY" --header "Content-Type: application/json" --data '{
"model": "qwen3-coder-30b-a3b-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Provide steps to configure Ray on LKE"
}
]
}'SERVICE_IP=$(kubectl get svc llm-gateway-istio -o yaml | yq -r '.status.loadBalancer.ingress[0].ip'){
"$schema": "https://opencode.ai/config.json",
"provider": {
"mymodel": {
"npm": "@ai-sdk/openai-compatible",
"name": "My awesome model",
"options": {
"baseURL": "http://$SERVICE_IP/v1"
},
"models": {
"qwen3-coder-30b-a3b-instruct": {
"name": "Qwen3 Coder"
}
}
}
}
}And then login using
opencode auth loginAnd scroll all the way to the bottom and choose other, enter your API key
Sometimes, it's nice to open-up the browser and just chat with the LLM, instead of firing up opencode. Let's see how to do that!
cat <<EOF >openwebui-values.yaml
ollama:
enabled: false
openaiBaseApiUrls:
- "http://ray-serve-llm-serve-svc.default.svc.cluster.local:8000/v1"
openaiApiKeys:
- "$OPEN_API_KEY"
service:
type: ClusterIP #change this to Loadbalancer to expose this publicly.
port: 8080
EOFhelm install open-webui open-webui/open-webui \
--namespace open-webui \
--create-namespace \
-f openwebui-values.yamlChat away!
Traditional AI API providers often come with unpredictable costs and data privacy concerns. Deploying on Akamai Cloud (formerly Linode) solves both:
- Performance: Akamai's Blackwell and Ada GPUs provide the high-memory bandwidth and throughput necessary to run complex 30B+ parameter models locally.
- Predictable Economics: You pay only for the hardware you use, eliminating the "token-based" pricing models of black-box AI services.
- Intellectual Property Protection: Because you control the entire stack, you can rest assured that your proprietary code and data are never used for training third-party models.










