Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KJSL: If CPU_ARCH is not set, e.g. on Apple Silicon ARM, then use uname #180

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

kevleyski
Copy link

Download fix up for macOS (Apple Silicon ARM)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 2, 2024
@13230668653
Copy link

Hello, I am using your code and according to the requirements of llama3, the error that cannot be run is,May I ask if there is an issue with my environmental dependency? How should I modify it? I am using i513500H CPU and running in the wsl2 environment.

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 208310) of binary: /home/qingyu/anaconda3/envs/llama3311/bin/python
Traceback (most recent call last):
File "/home/qingyu/anaconda3/envs/llama3311/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
example_chat_completion.py FAILED


Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-08_10:09:24
host : chenqingyuICer.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 208310)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 208310

@13230668653
Copy link

Hello, I am using your code and according to the requirements of llama3, the error that cannot be run is,May I ask if there is an issue with my environmental dependency? How should I modify it? I am using i513500H CPU and running in the wsl2 environment.

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 208310) of binary: /home/qingyu/anaconda3/envs/llama3311/bin/python
Traceback (most recent call last):
File "/home/qingyu/anaconda3/envs/llama3311/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:

<NO_OTHER_FAILURES>

Root Cause (first observed failure):

[0]:
time : 2024-05-08_10:09:24
host : chenqingyuICer.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 208310)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 208310

It turns out that the memory is exploding. What is the size of your computer's memory? I allocated 28GB, but it still doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants