This repository contains tools for tracking GPU usage and generating dashboards.
- Collect GPU usage data from multiple companies and projects
- Generate daily, weekly, monthly, and all-time GPU usage reports
- Update dashboards using Weights & Biases (wandb)
- Detect and alert on abnormal GPU usage rates
.
├── Dockerfile.check_dashboard
├── Dockerfile.main
├── README.md
├── config.yaml
├── main.py
├── requirements.txt
├── src
│ ├── alart
│ │ └── check_dashboard.py
│ ├── calculator
│ │ ├── blank_table.py
│ │ ├── gpu_usage_calculator.py
│ │ └── remove_tags.py
│ ├── tracker
│ │ ├── common.py
│ │ ├── config_parser.py
│ │ ├── run_manager.py
│ │ └── set_gpucount.py
│ ├── uploader
│ │ ├── artifact_handler.py
│ │ ├── data_processor.py
│ │ └── run_uploader.py
│ └── utils
│ └── config.py
└── image
└── gpu-dashboard.drawio.png
In the gpu-dashboard directory, run the following commands:
$ python3 -m venv .venv
$ . .venv/bin/activate
$ pip install -r requirements.txt
Request an AWS account from the administrator and assign access permissions to the following services in IAM:
- AWSBatch
- CloudWatch
- EC2
- ECS
- ECR
- EventBridge
- IAM
- VPC
Create a user for AWS CLI in IAM. Assign access permissions to the following service:
- ECR
Click on the created user and note down the following strings from the Access Keys tab:
- Access key ID
- Secret access key
Run the following command in your local Terminal to log in to AWS:
$ aws configure
AWS Access Key ID [None]: Access key ID
# Enter
AWS Secret Access Key [None]: Secret access key
# Enter
Default region name [None]: Leave blank
# Enter
Default output format [None]: Leave blank
# EnterAfter configuration, check the connection with the following command. If successful, it will output the list of S3 files:
$ aws s3 lsReference: AWS CLI Setup Tutorial
- Navigate to
Amazon ECR > Private registry > Repositories - Click
Create repository - Enter a repository name (e.g., geniac-gpu)
- Click
Create repository
- Click on the created repository name
- Click
View push commands - Execute the four displayed commands in order in your local Terminal
# Example commands
$ aws ecr get-login-password --region ap-northeast-1 | docker login --username AWS --password-stdin 111122223333.dkr.ecr.ap-northeast-1.amazonaws.com
$ docker build -t geniac-gpu .
$ docker tag geniac-gpu:latest 111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/geniac-gpu:latest
$ docker push 111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/geniac-gpu:latestAs the commands are unique to each repository, you can easily deploy from the second time onwards by writing these commands in a shell script
Create repositories for both gpu-dashboard and check-dashboard following the above steps
- Navigate to
Virtual Private Cloud > Your VPCs - Click
Create VPC - Select
VPC and morefromResources to create - Click
Create VPC
- Navigate to
IAM > Roles - Click
Create role - Set up the
Use case:- Select
Elastic Container ServiceforService - Select
Elastic Container Service TaskforUse case
- Select
- Select
AmazonEC2ContainerRegistryReadOnlyandCloudWatchLogsFullAccessforPermission policies - Click
Next - Enter
ecsTaskExecutionRoleforRole name - Click
Create role
- Navigate to
Amazon Elastic Container Service > Clusters - Click
Create Cluster - Enter a cluster name
- Click
Create
- Navigate to
Amazon Elastic Container Service > Task Definitions - Click
Create new Task Definition, then clickCreate new Task Definition - Enter a task definition family name
- Change
CPUandMemoryinTask sizeas needed - Select
ecsTaskExecutionRoleforTask role - Set up
Container - 1:- Enter the repository name and image URI pushed to ECR in
Container details - Set
Resource allocation limitsappropriately according toTask size
- Enter the repository name and image URI pushed to ECR in
- Click
Add environment variableinEnvironment variables - optionaland add the following:- Key: WANDB_API_KEY
- Value: {Your WANDB_API_KEY}
- Click
Create
- Navigate to
Amazon Elastic Container Service > Clusters > {Cluster Name} > Scheduled Tasks - Click
Create - Enter a rule name for
Scheduled rule name - Select
cron expressionforScheduled rule type - Enter an appropriate expression in
cron expression- Note that in this UI, you need to enter UTC time, so
cron(15 15 * * ? *)would be 0:15 AM Japan time
- Note that in this UI, you need to enter UTC time, so
- Enter a target ID for
Target ID - Select the task definition from
Task Definition family - Select VPC and subnets in
Networking - If there's no existing security group in
Security group, selectCreate a new security groupand create one - Click
Create
Execute the following commands to set up a local Python environment for running the scheduled script.
You can edit config.yaml to minimize impact on the production environment.
$ cd gpu-dashboard
$ python3 -m venv .venv
$ . .venv/bin/activatepython main.py [--api WANDB_API_KEY] [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD]--api: wandb API key (optional, can be set as an environment variable) --start-date: Data retrieval start date (optional) --end-date: Data retrieval end date (optional)
python src/alart/check_dashboard.py- src/tracker/: GPU usage data collection
- src/calculator/: GPU usage statistics calculation
- src/uploader/: Data upload to wandb
- src/alart/: Anomaly detection and alert functionality
- In AWS, navigate to
CloudWatch > Log groups - Click on
/ecs/{task definition name} - Click on the log stream to view logs
- Fetch latest data (src/tracker/)
- Set start_date and end_date
- If unspecified, both values default to yesterday's date
- Create a list of companies
- Fetch projects for each company [Public API]
- Fetch runs for each project [Private API]
- Filter by target_date, tags
- Detect and alert runs that initialize wandb multiple times on the same instance
- Fetch system metrics for each run [Public API]
- Aggregate by run id x date
- Set start_date and end_date
- Update data (src/uploader/)
- Retrieve csv up to yesterday from Artifacts
- Concatenate with the latest data and save to Artifacts
- Filter run ids
- Aggregate and update data (src/calculator)
- Remove latest tag
- Aggregate retrieved data
- Aggregate overall data
- Aggregate monthly data
- Aggregate weekly data
- Aggregate daily data
- Aggregate summary data
- Update overall table
- Update tables for each company
Here's the English translation of the text:
In the __set_gpucount method within src/tracker/run_manager.py, GPU counts for distributed processing are calculated based on different teams and configurations. Below are the calculation methods and specific examples.
- Retrieve the values of
num_nodesandnum_gpus, and multiply them to calculate the GPU count. - These values are obtained from the
configsection in the configuration file.
config = { "num_nodes": 2, "num_gpus": 8 }
gpu_count = 2 * 8 = 16
In this example, there are 2 nodes, each with 8 GPUs, resulting in a total GPU count of 16.
- Use the value of
world_sizeto determine the GPU count. world_sizeis obtained from theconfigsection in the configuration file.
config = { "world_size": 16 }
gpu_count = 16
In this example, the value of world_size is directly used as the GPU count.
- Use the value of
node.runInfo.gpuCount.
node.runInfo = { "gpuCount": 8 }
gpu_count = 8
In this example, the GPU count is directly obtained from runInfo.
This method aims to calculate the GPU count as accurately as possible by accommodating various configuration formats. However, when encountering unexpected data formats, it sets the GPU count to 0 for safety and outputs a warning.
