Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] use nn.module instead of tensor as model #3157

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

faaany
Copy link
Contributor

@faaany faaany commented Oct 11, 2024

What does this PR do?

When running the self-contained example on CUDA or XPU, I will get the following error:

initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.], device='cuda:0')
Traceback (most recent call last):
  File "/mnt/disk4/fanlilin/workspace/test.py", line 35, in <module>
    outputs = inputs @ model
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

The reason is that accelerator.prepare() will not move the tensor model to device. We should have a model with type torch.nn.Module.

So this PR updates the example code on this.

After the update, below is the result:

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.], device='cuda:0')
1 tensor([3., 4.], device='cuda:0')
2 tensor([5., 6.], device='cuda:0')
3 tensor([7., 8.], device='cuda:0')
w/ accumulation, the final model weight is 2.04000
w/o accumulation, the final model weight is 2.04000

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

@faaany
Copy link
Contributor Author

faaany commented Oct 15, 2024

I think the failed CI is not caused by my change. Could you pls help retrigger the CI? Thanks a lot! @muellerzr

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants