-
Notifications
You must be signed in to change notification settings - Fork 583
Description
Describe the bug
Firstly,I use pip install --no-build-isolation transformer_engine[pytorch] to install TE. And I tried to train my model with megatron-swift. I got this error:
RuntimeError: /TransformerEngine/transformer_engine/common/transformer_engine.cpp:314 in function Allocate: Cannot allocate a new NVTETensor. Maximum number of tensors reached: 70849. There is probably a memory leak in your application.
Following the https://github.com/NVIDIA/TransformerEngine/issues/2189 , I changed the code in transformer_engine.cpp and installed TE from source code. TE has been successfully installed, but during model training, I got this error:
ImportError: cannot import name 'load_framework_extension' from 'transformer_engine.common' (unknown location)
I can't find any solutions. I hope you can help me.
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
- Method of Transformer Engine install: [pip install or from source]. Please specify exact commands you used to install.
- If method of install is [Docker], provide
docker pull&docker runcommands used
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version ubuntu 22.04
- PyTorch version 2.8.0
- Python version 3.12
- Transformer Engine version 2.10 from source code
- CUDA version 12.8
- CUDNN version
Device details
- GPU model A100
Additional context
Add any other context about the problem here.