diff --git a/llama-benchmarking/readme.md b/llama-benchmarking/readme.md new file mode 100644 index 00000000..b4db1b02 --- /dev/null +++ b/llama-benchmarking/readme.md @@ -0,0 +1,78 @@ +# Llama 3.1 8B Training Example for Bacalhau + +This repository contains a single-node training example using NVIDIA's Llama 3.1 8B model, adapted for running on Bacalhau. This is a simplified version that demonstrates basic LLM training capabilities using 8 GPUs on a single node. + +Based on https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama31-8b-dgxc-benchmarking-a + +## Overview + +- Single-node training of Llama 3.1 8B model +- Uses NVIDIA's NeMo framework +- Supports 8 GPUs on a single node +- Uses synthetic data by default +- Supports both FP8 and BF16 data types + + +## Structure + +``` +. +├── Dockerfile # Container definition using NeMo base image +├── llama3.1_24.11.1/ # Configuration files +│ └── llama3.1_8b.yaml # 8B model configuration +├── run_training.sh # Main training script +└── sample-jobs.yaml # Bacalhau job definition +``` + +## Building and Pushing the Image + +1. Login to GitHub Container Registry: +```bash +echo $GITHUB_PAT | docker login ghcr.io -u YOUR_GITHUB_USERNAME --password-stdin +``` + +2. Build and push the image: +```bash +docker buildx create --use +docker buildx build --platform linux/amd64,linux/arm64 \ + -t ghcr.io/bacalhau-project/llama3-benchmark:24.12 \ + -t ghcr.io/bacalhau-project/llama3-benchmark:latest \ + --push . +``` + +## Running on Bacalhau + +Basic training job (10 steps with synthetic data): +```bash +bacalhau job run sample-job.yaml -V "steps=10" +``` + +Environment variables for customization: +- `DTYPE`: Data type (fp8, bf16) +- `MAX_STEPS`: Number of training steps +- `USE_SYNTHETIC_DATA`: Whether to use synthetic data (default: true) + + +## Output + +Training results and logs are saved to the `/results` directory which gets: +1. Published to S3 (bacalhau-nvidia-job-results bucket) +2. Available in the job outputs + +The results include: +- Training logs +- Performance metrics +- TensorBoard logs + +## Resources Required + +Fixed requirements: +- 8x NVIDIA H100 GPUs (80GB each) +- 32 CPU cores +- 640GB system memory + +## Notes + +- Uses synthetic data by default - no data preparation needed +- Training script is optimized for H100 GPUs +- All settings are tuned for single-node performance diff --git a/llama-benchmarking/sample-job.yaml b/llama-benchmarking/sample-job.yaml new file mode 100644 index 00000000..bc0fb6e9 --- /dev/null +++ b/llama-benchmarking/sample-job.yaml @@ -0,0 +1,32 @@ +Name: llama3-training-{{.steps}} +Type: ops +Tasks: + - Name: main + Engine: + Type: docker + Params: + Image: ghcr.io/bacalhau-project/llama3-benchmark:latest + EnvironmentVariables: + - MAX_STEPS={{.steps}} + Entrypoint: + - /bin/bash + Parameters: + - -c + - | + ./run_training.sh + Publisher: + Type: s3 + Params: + Bucket: bacalhau-nvidia-job-results + Key: "llama3-training/{date}/{jobID}" + Region: us-east-1 + ResultPaths: + - Name: results + Path: /results + Resources: + CPU: "32" + Memory: "640G" + GPU: "8" + Timeouts: + ExecutionTimeout: 3600 + QueueTimeout: 3600