This repository contains a single-node training example using NVIDIA's Llama 3.1 8B model, adapted for running on Bacalhau. This is a simplified version that demonstrates basic LLM training capabilities using 8 GPUs on a single node.
- Single-node training of Llama 3.1 8B model
- Uses NVIDIA's NeMo framework
- Supports 8 GPUs on a single node
- Uses synthetic data by default
- Supports both FP8 and BF16 data types
.
├── Dockerfile # Container definition using NeMo base image
├── llama3.1_24.11.1/ # Configuration files
│ └── llama3.1_8b.yaml # 8B model configuration
├── run_training.sh # Main training script
└── sample-jobs.yaml # Bacalhau job definition
- Login to GitHub Container Registry:
echo $GITHUB_PAT | docker login ghcr.io -u YOUR_GITHUB_USERNAME --password-stdin
- Build and push the image:
docker buildx create --use
docker buildx build --platform linux/amd64,linux/arm64 \
-t ghcr.io/bacalhau-project/llama3-benchmark:24.12 \
-t ghcr.io/bacalhau-project/llama3-benchmark:latest \
--push .
Basic training job (10 steps with synthetic data):
bacalhau job run sample-job.yaml -V "steps=10"
Environment variables for customization:
DTYPE
: Data type (fp8, bf16)MAX_STEPS
: Number of training stepsUSE_SYNTHETIC_DATA
: Whether to use synthetic data (default: true)
Training results and logs are saved to the /results
directory which gets:
- Published to S3
- Available in the job outputs
The results include:
- Training logs
- Performance metrics
- TensorBoard logs
Fixed requirements:
- 8x NVIDIA H100 GPUs (80GB each)
- 32 CPU cores
- 640GB system memory
- Uses synthetic data by default - no data preparation needed
- Training script is optimized for H100 GPUs
- All settings are tuned for single-node performance