KJSL: If CPU_ARCH is not set, e.g. on Apple Silicon ARM, then use uname #180

kevleyski · 2024-05-02T04:10:39Z

Download fix up for macOS (Apple Silicon ARM)

…e it from uname metadata

NVIDIA NCCL -> M3 Max (cpu)

13230668653 · 2024-05-08T02:26:39Z

Hello, I am using your code and according to the requirements of llama3, the error that cannot be run is，May I ask if there is an issue with my environmental dependency? How should I modify it? I am using i513500H CPU and running in the wsl2 environment.

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 208310) of binary: /home/qingyu/anaconda3/envs/llama3311/bin/python
Traceback (most recent call last):
File "/home/qingyu/anaconda3/envs/llama3311/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
example_chat_completion.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-08_10:09:24
host : chenqingyuICer.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 208310)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 208310

13230668653 · 2024-05-08T04:00:10Z

Hello, I am using your code and according to the requirements of llama3, the error that cannot be run is，May I ask if there is an issue with my environmental dependency? How should I modify it? I am using i513500H CPU and running in the wsl2 environment.

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 208310) of binary: /home/qingyu/anaconda3/envs/llama3311/bin/python
Traceback (most recent call last):
File "/home/qingyu/anaconda3/envs/llama3311/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qingyu/anaconda3/envs/llama3311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:

<NO_OTHER_FAILURES>

Root Cause (first observed failure):

[0]:
time : 2024-05-08_10:09:24
host : chenqingyuICer.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 208310)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 208310

It turns out that the memory is exploding. What is the size of your computer's memory? I allocated 28GB, but it still doesn't work.

IgorKasianenko and others added 3 commits April 20, 2024 12:46

run completion cpu

21833ad

working notebook and python script

576b86e

KJSL: If CPU_ARCH is not set, e.g. on Apple Silicon ARM, then populat…

ac4271d

…e it from uname metadata

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 2, 2024

kevleyski added 7 commits May 3, 2024 00:22

KJSL Apple Silicon M3 Max

10473ba

Merge pull request #1 from IgorKasianenko/main

daf8660

NVIDIA NCCL -> M3 Max (cpu)

KJSL Apple Silicon M3 Max

a01dafd

KJSL Apple Silicon M3 Max

a81026a

KJSL gloo default master port

9d08b4e

KJSL diff tokenizer

66cd498

KJSL diff tokenizer

fc31287

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KJSL: If CPU_ARCH is not set, e.g. on Apple Silicon ARM, then use uname #180

KJSL: If CPU_ARCH is not set, e.g. on Apple Silicon ARM, then use uname #180

kevleyski commented May 2, 2024

13230668653 commented May 8, 2024

13230668653 commented May 8, 2024

Failures:

Root Cause (first observed failure):

KJSL: If CPU_ARCH is not set, e.g. on Apple Silicon ARM, then use uname #180

Are you sure you want to change the base?

KJSL: If CPU_ARCH is not set, e.g. on Apple Silicon ARM, then use uname #180

Conversation

kevleyski commented May 2, 2024

13230668653 commented May 8, 2024

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-05-08_10:09:24 host : chenqingyuICer. rank : 0 (local_rank: 0) exitcode : -9 (pid: 208310) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 208310

13230668653 commented May 8, 2024

Failures:

Root Cause (first observed failure):

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-08_10:09:24
host : chenqingyuICer.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 208310)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 208310