Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the entire attention's _elapsed_time is repeatedly assigned to attention_comlumn and attention_row #27

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file modified .gitignore
100644 → 100755
Empty file.
Empty file modified Dockerfile
100644 → 100755
Empty file.
Empty file modified License
100644 → 100755
Empty file.
Empty file modified core/__init__.py
100644 → 100755
Empty file.
Empty file modified core/grouped_gemm_util.py
100644 → 100755
Empty file.
Empty file modified download/AICB_v1.0.deb
100644 → 100755
Empty file.
Empty file modified images/detail_log.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/readme_01.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/result_log.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/simai_dingtalk.jpg
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/simai_wechat.jpg
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/time_log.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/tutorial_1.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/tutorial_2.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/tutorial_3.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/tutorial_4.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/tutorial_5.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/tutorial_6.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/tutorial_7.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified log_analyzer/analyze_res_csv.py
100644 → 100755
Empty file.
Empty file modified log_analyzer/ds_comm_log_analyzer.py
100644 → 100755
Empty file.
Empty file modified log_analyzer/plot.py
100644 → 100755
Empty file.
Empty file modified log_analyzer/utils.py
100644 → 100755
Empty file.
Empty file modified results/visual_output/A100_example.html
100644 → 100755
Empty file.
Empty file modified scripts/coll_comm_check.sh
100644 → 100755
Empty file.
Empty file modified scripts/deepspeed_llama.sh
100644 → 100755
Empty file.
Empty file modified scripts/run_in_cluster.py
100644 → 100755
Empty file.
Empty file modified utils/timer.py
100644 → 100755
Empty file.
55 changes: 37 additions & 18 deletions utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,8 +244,10 @@ def Comp_with_aiob(workload, compute_cache):
for item in workload.workload:
if item.comm_type == CommType.computation:
for key in compute_cache:
key_temp = key.split("_")[0]
if key_temp in item.stage:
item._elapsed_time = 0
key_split = key.rsplit('_', 1)
stage_split = item.stage.rsplit('.', 2)
if (len(key_split) > 1 and len(stage_split) > 2) and (key_split[0] == stage_split[2]) and (key_split[1] == stage_split[0]):
item._elapsed_time = compute_cache[key]
break
return workload
Expand Down Expand Up @@ -287,8 +289,10 @@ def get_comp_out(args):


def extract_averages(file_path,args):
attention_avg_sum = 0.0
mlp_avg_sum = 0.0
attention_column_avg_sum = 0.0
attention_row_avg_sum = 0.0
mlp_column_avg_sum = 0.0
mlp_row_avg_sum = 0.0
other_avgs = {}
grad_forward = 0.0
grad_backward = 0.0
Expand All @@ -314,31 +318,46 @@ def extract_averages(file_path,args):
grad_backward = float(avg_match.group(1)) * 1000
elif avg_match and current_section:
avg_value = float(avg_match.group(1)) * 1000
if "atten" in current_section or current_section == "layernorm":

if current_section in ["atten_qkv", "atten_core_qk", "atten_core_softmax", "atten_core_contex"]:
if args.recompute_activations and 'flash' in current_section:
attention_column_avg_sum += avg_value*2
else:
attention_column_avg_sum += avg_value
elif current_section in ["atten_linear", "layernorm2"]:
if args.recompute_activations and 'flash' in current_section:
attention_avg_sum += avg_value*2
attention_row_avg_sum += avg_value*2
else:
attention_avg_sum += avg_value
elif "mlp" in current_section or current_section == "layernorm2":
mlp_avg_sum += avg_value
attention_row_avg_sum += avg_value
elif current_section in ["mlp_linear_1", "mlp_gelu"]:
mlp_column_avg_sum += avg_value
elif current_section in ["mlp_linear_2"]:
mlp_row_avg_sum += avg_value
else:
other_avgs[current_section] = avg_value

# 四舍五入并转换为整数
attention_forward = round(attention_avg_sum)
attention_backward = attention_forward
mlp_forward = round(mlp_avg_sum)
mlp_backward = mlp_forward
attention_column_forward = round(attention_column_avg_sum)
attention_row_forward = round(attention_row_avg_sum)
attention_column_backward = attention_column_forward
attention_row_backward = attention_row_forward
mlp_column_forward = round(mlp_column_avg_sum)
mlp_row_forward = round(mlp_row_avg_sum)
mlp_column_backward = mlp_column_forward
mlp_row_backward = mlp_row_forward

grad_backward = round(grad_backward)
grad_forward = round(grad_forward)
other_avgs_int = {k: round(v) for k, v in other_avgs.items() if k != "param_time"}

a100_compute_cache = {
"attention_forward": attention_forward,
"attention_backward": attention_backward,
"mlp_forward": mlp_forward,
"mlp_backward": mlp_backward,
"attention_column_forward": attention_column_forward,
"attention_row_forward": attention_row_forward,
"attention_column_backward": attention_column_backward,
"attention_row_backward": attention_row_backward,
"mlp_column_forward": mlp_column_forward,
"mlp_row_forward": mlp_row_forward,
"mlp_column_backward": mlp_column_backward,
"mlp_row_backward": mlp_row_backward,
"grad_forward": grad_forward,
"grad_backward": grad_backward,
}
Expand Down
Empty file modified visualize/example.html
100644 → 100755
Empty file.
Empty file modified visualize/inputs/A100_example.csv
100644 → 100755
Empty file.
Empty file modified workload/Workload_spec_v1.1.csv
100644 → 100755
Empty file.
Empty file modified workload/aiob_inputs/Example.txt
100644 → 100755
Empty file.
Empty file modified workload/physical/micro_test/all_gather_workload.csv
100644 → 100755
Empty file.
Empty file modified workload/physical/micro_test/all_reduce_workload.csv
100644 → 100755
Empty file.
Empty file modified workload/physical/micro_test/all_to_all_workload.csv
100644 → 100755
Empty file.
Empty file modified workload/physical/micro_test/multi_all_reduce_workload.csv
100644 → 100755
Empty file.
Empty file modified workload/physical/micro_test/reduce_scatter_workload.csv
100644 → 100755
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file modified workload/simAI/micro_test/all_gather.txt
100644 → 100755
Empty file.
Empty file modified workload/simAI/micro_test/all_reduce.txt
100644 → 100755
Empty file.
Empty file modified workload/simAI/micro_test/all_to_all.txt
100644 → 100755
Empty file.
Empty file modified workload/simAI/micro_test/muti_all_reduce.txt
100644 → 100755
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file.
Empty file modified workload_generator/__init__.py
100644 → 100755
Empty file.
Empty file modified workload_generator/analysis_pytorch_trace.py
100644 → 100755
Empty file.
Empty file modified workload_generator/generate_collective_test.py
100644 → 100755
Empty file.
Empty file modified workload_generator/generate_deepspeed_stage1_2_workload.py
100644 → 100755
Empty file.
Empty file modified workload_generator/generate_deepspeed_stage3_workload.py
100644 → 100755
Empty file.
Empty file modified workload_generator/generate_ds_trace_replay_workload.py
100644 → 100755
Empty file.
Empty file modified workload_generator/generate_megatron_workload.py
100644 → 100755
Empty file.
Empty file modified workload_generator/mocked_model/MockedDeepspeed.py
100644 → 100755
Empty file.
4 changes: 2 additions & 2 deletions workload_generator/mocked_model/MockedMegatron.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def backward(self):
(self.seq_len, self.batch_size, self.output_size),
self.weight.shape,
),
stage="backward.MegatronRowLinear" + self.name,
stage="backward.MegatronRowLinear." + self.name,
)
)
workloads.append(
Expand All @@ -119,7 +119,7 @@ def backward(self):
(self.output_size, self.seq_len * self.batch_size),
(self.seq_len * self.batch_size, self.input_size_per_partition),
),
stage="backward.MegatronRowLinear" + self.name,
stage="backward.MegatronRowLinear." + self.name,
)
)
return workloads
Expand Down
Empty file modified workload_generator/mocked_model/MockedModel.py
100644 → 100755
Empty file.
Empty file modified workload_generator/mocked_model/__init__.py
100644 → 100755
Empty file.
Empty file modified workload_generator/workload_generator.py
100644 → 100755
Empty file.