-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
110 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# Llama 3.1 8B Training Example for Bacalhau | ||
|
||
This repository contains a single-node training example using NVIDIA's Llama 3.1 8B model, adapted for running on Bacalhau. This is a simplified version that demonstrates basic LLM training capabilities using 8 GPUs on a single node. | ||
|
||
Based on https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/llama31-8b-dgxc-benchmarking-a | ||
|
||
## Overview | ||
|
||
- Single-node training of Llama 3.1 8B model | ||
- Uses NVIDIA's NeMo framework | ||
- Supports 8 GPUs on a single node | ||
- Uses synthetic data by default | ||
- Supports both FP8 and BF16 data types | ||
|
||
|
||
## Structure | ||
|
||
``` | ||
. | ||
├── Dockerfile # Container definition using NeMo base image | ||
├── llama3.1_24.11.1/ # Configuration files | ||
│ └── llama3.1_8b.yaml # 8B model configuration | ||
├── run_training.sh # Main training script | ||
└── sample-jobs.yaml # Bacalhau job definition | ||
``` | ||
|
||
## Building and Pushing the Image | ||
|
||
1. Login to GitHub Container Registry: | ||
```bash | ||
echo $GITHUB_PAT | docker login ghcr.io -u YOUR_GITHUB_USERNAME --password-stdin | ||
``` | ||
|
||
2. Build and push the image: | ||
```bash | ||
docker buildx create --use | ||
docker buildx build --platform linux/amd64,linux/arm64 \ | ||
-t ghcr.io/bacalhau-project/llama3-benchmark:24.12 \ | ||
-t ghcr.io/bacalhau-project/llama3-benchmark:latest \ | ||
--push . | ||
``` | ||
|
||
## Running on Bacalhau | ||
|
||
Basic training job (10 steps with synthetic data): | ||
```bash | ||
bacalhau job run sample-job.yaml -V "steps=10" | ||
``` | ||
|
||
Environment variables for customization: | ||
- `DTYPE`: Data type (fp8, bf16) | ||
- `MAX_STEPS`: Number of training steps | ||
- `USE_SYNTHETIC_DATA`: Whether to use synthetic data (default: true) | ||
|
||
|
||
## Output | ||
|
||
Training results and logs are saved to the `/results` directory which gets: | ||
1. Published to S3 (bacalhau-nvidia-job-results bucket) | ||
2. Available in the job outputs | ||
|
||
The results include: | ||
- Training logs | ||
- Performance metrics | ||
- TensorBoard logs | ||
|
||
## Resources Required | ||
|
||
Fixed requirements: | ||
- 8x NVIDIA H100 GPUs (80GB each) | ||
- 32 CPU cores | ||
- 640GB system memory | ||
|
||
## Notes | ||
|
||
- Uses synthetic data by default - no data preparation needed | ||
- Training script is optimized for H100 GPUs | ||
- All settings are tuned for single-node performance |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
Name: llama3-training-{{.steps}} | ||
Type: ops | ||
Tasks: | ||
- Name: main | ||
Engine: | ||
Type: docker | ||
Params: | ||
Image: ghcr.io/bacalhau-project/llama3-benchmark:latest | ||
EnvironmentVariables: | ||
- MAX_STEPS={{.steps}} | ||
Entrypoint: | ||
- /bin/bash | ||
Parameters: | ||
- -c | ||
- | | ||
./run_training.sh | ||
Publisher: | ||
Type: s3 | ||
Params: | ||
Bucket: bacalhau-nvidia-job-results | ||
Key: "llama3-training/{date}/{jobID}" | ||
Region: us-east-1 | ||
ResultPaths: | ||
- Name: results | ||
Path: /results | ||
Resources: | ||
CPU: "32" | ||
Memory: "640G" | ||
GPU: "8" | ||
Timeouts: | ||
ExecutionTimeout: 3600 | ||
QueueTimeout: 3600 |