Skip to content

Commit

Permalink
add readme
Browse files Browse the repository at this point in the history
  • Loading branch information
wdbaruni committed Jan 16, 2025
1 parent c28bc05 commit 5dd0286
Show file tree
Hide file tree
Showing 2 changed files with 110 additions and 0 deletions.
78 changes: 78 additions & 0 deletions llama-benchmarking/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Llama 3.1 8B Training Example for Bacalhau

This repository contains a single-node training example using NVIDIA's Llama 3.1 8B model, adapted for running on Bacalhau. This is a simplified version that demonstrates basic LLM training capabilities using 8 GPUs on a single node.

Based on https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama31-8b-dgxc-benchmarking-a

## Overview

- Single-node training of Llama 3.1 8B model
- Uses NVIDIA's NeMo framework
- Supports 8 GPUs on a single node
- Uses synthetic data by default
- Supports both FP8 and BF16 data types


## Structure

```
.
├── Dockerfile # Container definition using NeMo base image
├── llama3.1_24.11.1/ # Configuration files
│ └── llama3.1_8b.yaml # 8B model configuration
├── run_training.sh # Main training script
└── sample-jobs.yaml # Bacalhau job definition
```

## Building and Pushing the Image

1. Login to GitHub Container Registry:
```bash
echo $GITHUB_PAT | docker login ghcr.io -u YOUR_GITHUB_USERNAME --password-stdin
```

2. Build and push the image:
```bash
docker buildx create --use
docker buildx build --platform linux/amd64,linux/arm64 \
-t ghcr.io/bacalhau-project/llama3-benchmark:24.12 \
-t ghcr.io/bacalhau-project/llama3-benchmark:latest \
--push .
```

## Running on Bacalhau

Basic training job (10 steps with synthetic data):
```bash
bacalhau job run sample-job.yaml -V "steps=10"
```

Environment variables for customization:
- `DTYPE`: Data type (fp8, bf16)
- `MAX_STEPS`: Number of training steps
- `USE_SYNTHETIC_DATA`: Whether to use synthetic data (default: true)


## Output

Training results and logs are saved to the `/results` directory which gets:
1. Published to S3 (bacalhau-nvidia-job-results bucket)
2. Available in the job outputs

The results include:
- Training logs
- Performance metrics
- TensorBoard logs

## Resources Required

Fixed requirements:
- 8x NVIDIA H100 GPUs (80GB each)
- 32 CPU cores
- 640GB system memory

## Notes

- Uses synthetic data by default - no data preparation needed
- Training script is optimized for H100 GPUs
- All settings are tuned for single-node performance
32 changes: 32 additions & 0 deletions llama-benchmarking/sample-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Name: llama3-training-{{.steps}}
Type: ops
Tasks:
- Name: main
Engine:
Type: docker
Params:
Image: ghcr.io/bacalhau-project/llama3-benchmark:latest
EnvironmentVariables:
- MAX_STEPS={{.steps}}
Entrypoint:
- /bin/bash
Parameters:
- -c
- |
./run_training.sh
Publisher:
Type: s3
Params:
Bucket: bacalhau-nvidia-job-results
Key: "llama3-training/{date}/{jobID}"
Region: us-east-1
ResultPaths:
- Name: results
Path: /results
Resources:
CPU: "32"
Memory: "640G"
GPU: "8"
Timeouts:
ExecutionTimeout: 3600
QueueTimeout: 3600

0 comments on commit 5dd0286

Please sign in to comment.