Skip to content

cannot import name 'load_framework_extension' from 'transformer_engine.common' #2515

@LiYufengzz

Description

@LiYufengzz

Describe the bug
Firstly,I use pip install --no-build-isolation transformer_engine[pytorch] to install TE. And I tried to train my model with megatron-swift. I got this error:
RuntimeError: /TransformerEngine/transformer_engine/common/transformer_engine.cpp:314 in function Allocate: Cannot allocate a new NVTETensor. Maximum number of tensors reached: 70849. There is probably a memory leak in your application.
Following the https://github.com/NVIDIA/TransformerEngine/issues/2189 , I changed the code in transformer_engine.cpp and installed TE from source code. TE has been successfully installed, but during model training, I got this error:
ImportError: cannot import name 'load_framework_extension' from 'transformer_engine.common' (unknown location)
I can't find any solutions. I hope you can help me.

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
  • Method of Transformer Engine install: [pip install or from source]. Please specify exact commands you used to install.
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version ubuntu 22.04
  • PyTorch version 2.8.0
  • Python version 3.12
  • Transformer Engine version 2.10 from source code
  • CUDA version 12.8
  • CUDNN version

Device details

  • GPU model A100

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions