A Kubernetes-based ML platform on Azure for fine-tuning and serving GPT-2 models.
┌─────────────────────────────────────────────────────────────┐
│ NGINX Ingress (:31622) │
│ / → Frontend │ /api → Backend │
│ │ /jupyter → JupyterLab │
└─────────────────────────────────────────────────────────────┘
│
┌──────────────────────┴──────────────────────┐
│ │
┌───────┴───────┐ ┌────────┴────────┐
│ Control Plane │ │ GPU Worker │
│ (Standard) │ │ (NC4as T4 v3) │
├───────────────┤ ├─────────────────┤
│ • Backend │ │ • JupyterLab │
│ • Frontend │ │ • Training Jobs │
│ • K8s Control │ │ • NVIDIA T4 GPU │
└───────┬───────┘ └────────┬────────┘
│ │
└────────────────┬───────────────────────────┘
│
┌──────────┴──────────┐
│ Azure Blob Storage │
│ (uploads, models) │
└─────────────────────┘
| Component | Description |
|---|---|
| Frontend | Streamlit UI for chat and model management |
| Backend | FastAPI for inference and training job submission |
| JupyterLab | Interactive notebooks with GPU access |
| NGINX Ingress | Unified access with path-based routing |
- Azure subscription
- Terraform installed
- Docker installed
cd terraform
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your values
terraform init
terraform apply# Login to ACR
az acr login --name <acr-name>
# Build images
docker build -t <acr>.azurecr.io/ml-backend:v1 docker/backend/
docker build -t <acr>.azurecr.io/ml-frontend:v1 docker/frontend/
docker build -t <acr>.azurecr.io/ml-jupyterlab:v1 docker/jupyterlab/
# Push images
docker push <acr>.azurecr.io/ml-backend:v1
docker push <acr>.azurecr.io/ml-frontend:v1
docker push <acr>.azurecr.io/ml-jupyterlab:v1kubectl apply -f k8s/Via NGINX Ingress (port 31622):
- Frontend:
http://<control-plane-ip>:31622/ - Backend API:
http://<control-plane-ip>:31622/api/ - JupyterLab:
http://<control-plane-ip>:31622/jupyter/
├── terraform/ # Azure infrastructure (VMs, ACR, Storage)
├── cloud-init/ # VM initialization scripts
├── docker/ # Container images
│ ├── backend/ # FastAPI backend
│ ├── frontend/ # Streamlit frontend
│ └── jupyterlab/ # JupyterLab with GPU support
├── k8s/ # Kubernetes manifests
└── architecture-diagram/
-
Blob Storage Cache Delay - Azure blobfuse has 120s cache; fixed by forcing
lsrefresh before file reads -
Wrong ACR URL - Hardcoded registry URL; fixed by using environment variable
-
Missing imagePullSecrets - Training jobs couldn't pull images; added ACR credentials to job spec
-
Wrong Namespace - Jobs created in
defaultinstead ofml-platform; fixed namespace reference -
kubeadm Version Mismatch - Worker/control plane version drift; ensure same K8s version
-
Missing CNI Plugins - Pods failed with "loopback not found"; manually installed CNI binaries
-
NVIDIA Driver Order - GPU not visible in containers; install driver before container toolkit
- Cloud: Azure (VMs, ACR, Blob Storage)
- Orchestration: Kubernetes (kubeadm)
- ML: PyTorch, Transformers, GPT-2
- Backend: FastAPI
- Frontend: Streamlit
- GPU: NVIDIA T4 with CUDA 12.1