Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot import name 'TEDotProductAttentionMLA' when running examples/deepseek_v2/run_mcore_deepseek.sh #359

Open
dreasysnail opened this issue Oct 4, 2024 · 4 comments

Comments

@dreasysnail
Copy link

Thank you for the great project! When I run examples/deepseek_v2/run_mcore_deepseek.sh I got an error as below:

Traceback (most recent call last):
  File "/mnt/task_runtime/examples/deepseek_v2/pretrain_deepseek.py", line 37, in <module>
    from megatron_patch.model.deepseek_v2.layer_specs import (
  File "/mnt/task_runtime/megatron_patch/model/deepseek_v2/layer_specs.py", line 19, in <module>
    from megatron.core.transformer.custom_layers.transformer_engine import (
ImportError: cannot import name 'TEDotProductAttentionMLA' from 'megatron.core.transformer.custom_layers.transformer_engine' (/mnt/task_runtime/PAI-Megatron-LM-240718/megatron/core/transformer/custom_layers/transformer_engine.py)

It appears that in this link the code is attempting to import 'TEDotProductAttentionMLA', but when I checked the megatron.core.transformer.custom_layers.transformer_engine file, I did not find 'TEDotProductAttentionMLA'.

Any help appreciated!

@dreasysnail
Copy link
Author

@Jiayi-Pan

@NiuMa-1234
Copy link

Hi, have you solved the problem? I 'm trying to use TEDotProductAttentionMLA, too, and I found the difference between it and its original TEDotProductAttention is only the definition of kv_channels. So I just manually change the kv_channels and keep using the TEDotProductAttention. I'm not sure if this's right.

@Jiayi-Pan
Copy link

Hi we've solved the issue. You can just update the git submodule to the latest version

@NiuMa-1234
Copy link

NiuMa-1234 commented Oct 29, 2024

Hi we've solved the issue. You can just update the git submodule to the latest version

Hi, I've tested the latest TEDotProductAttentionMLA but I found the training speed has dropped a bit (from 5.9 tokens/s to 4.4 on 8*8B model). Would this be normal?

I used torch.profiler and I found the mainly difference of training time between these two versions is caused by this funtion : void transformer_engine::scaled_aligned_causal_masked_softmax_warp_forward<__nv_bfloat16, __nv_bfloat16, float, 13>(__nv_bfloat16*, __nv_bfloat16 const*, float, int, int, int)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants