Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Nanotron batch detection doesn't work #286

Open
hynky1999 opened this issue Sep 3, 2024 · 0 comments
Open

[BUG] Nanotron batch detection doesn't work #286

hynky1999 opened this issue Sep 3, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@hynky1999
Copy link
Collaborator

hynky1999 commented Sep 3, 2024

Describe the bug

Running nanotron with batch_size = 0 causes nanotron to crash during batch detection.

(lighteval-main) hynek_kydlicek@ip-26-0-162-233:/fsx/hynek_kydlicek/projects/lighteval-main-branch$ torchrun --standalone --nnodes=1 --nproc-per-node=1  src/lighteval/__main__.py nanotron --checkpoint_config_path ./nanotron/checkpoints/0/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
WARNING:lighteval.logging.hierarchical_logger:main: (0, './nanotron/checkpoints/0/config.yaml'), (1, 'examples/nanotron/lighteval_config_override_template.yaml'), (2, '/fsx/hynek_kydlicek/.cache/huggingface'),  {
WARNING:lighteval.logging.hierarchical_logger:  Load nanotron config {
skip_unused_config_keys set
Skip_null_keys set
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.013603]
WARNING:lighteval.logging.hierarchical_logger:  WARNING: --max_samples WAS SET. THESE NUMBERS ARE ONLY PARTIAL AND SHOULD NOT BE USED FOR COMPARISON UNLESS YOU KNOW WHAT YOU ARE DOING.
WARNING:lighteval.logging.hierarchical_logger:  Test all gather {
WARNING:lighteval.logging.hierarchical_logger:    Test gather tensor
WARNING:lighteval.logging.hierarchical_logger:[TEST] Running NCCL sync for ranks [0]
WARNING:lighteval.logging.hierarchical_logger:[TEST] NCCL sync for ranks [0]
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.661526]
WARNING:lighteval.logging.hierarchical_logger:  Model loading {
/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
WARNING:lighteval.models.nanotron_model:Building model
WARNING:lighteval.models.nanotron_model:Sanity checks on model
WARNING:lighteval.models.nanotron_model:Loading checkpoint from ./nanotron/checkpoints/0:
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 1288.92it/s]
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.361026]
WARNING:lighteval.logging.hierarchical_logger:  Tasks loading {
WARNING:lighteval.logging.hierarchical_logger:    If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`.
WARNING:lighteval.logging.hierarchical_logger:    gsm8k main
WARNING:lighteval.logging.hierarchical_logger:    Loading documents, and requests
Token indices sequence length is longer than the specified maximum sequence length for this model (985 > 256). Running this sequence through the model will result in indexing errors
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:01.286350]
WARNING:lighteval.logging.hierarchical_logger:  Setting seeds and waiting for all processes {
WARNING:lighteval.logging.hierarchical_logger:    setting seed to 1234 for random and numpy
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.000133]
WARNING:lighteval.logging.hierarchical_logger:  Evaluation {
WARNING:lighteval.logging.hierarchical_logger:    Evaluate on 1 tasks.
WARNING:lighteval.logging.hierarchical_logger:    Running RequestType.GREEDY_UNTIL requests
WARNING:lighteval.logging.hierarchical_logger:    You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring.
greedy -- Node 0:   0%|                                                                                                                                                                                                                                                                                    | 0/1 [00:00<?, ?it/s]WARNING:lighteval.models.nanotron_model:Detecting largest batch size
WARNING:lighteval.models.nanotron_model:Testing batch size 512
greedy -- Node 0:   0%|                                                                                                                                                                                                                                                                                    | 0/1 [00:00<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.164193]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:02.496358]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/__main__.py", line 93, in <module>
[rank0]:     cli_evaluate()
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/__main__.py", line 63, in cli_evaluate
[rank0]:     main_nanotron(args.checkpoint_config_path, args.lighteval_config_path, args.cache_dir)
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/logging/hierarchical_logger.py", line 175, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/main_nanotron.py", line 97, in main
[rank0]:     pipeline.evaluate()
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/pipeline.py", line 235, in evaluate
[rank0]:     sample_id_to_responses = self._run_model()
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/pipeline.py", line 264, in _run_model
[rank0]:     responses = run_model(requests, override_bs=self.pipeline_parameters.override_batch_size)
[rank0]:   File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 1149, in greedy_until
[rank0]:     batch_size = self._get_batch_size(
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 320, in _get_batch_size
[rank0]:     batch_size = forward_batch()
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/utils/parallelism.py", line 104, in decorator
[rank0]:     return function(batch_size, *args, **kwargs)
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 317, in forward_batch
[rank0]:     F.log_softmax(self._model_call(test_batch).float(), dim=-1).cpu()
[rank0]:   File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 342, in _model_call
[rank0]:     return self.model(inputs)
[rank0]:   File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]: TypeError: LlamaModel.forward() missing 1 required positional argument: 'input_mask'
E0903 13:22:41.743000 140200056006464 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1010958) of binary: /fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/bin/python
Traceback (most recent call last):
  File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/lighteval/__main__.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-03_13:22:41
  host      : ip-26-0-162-233.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1010958)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

To Reproduce

torchrun --standalone --nnodes=1 --nproc-per-node=1 src/lighteval/__main__.py nanotron --checkpoint_config_path ./nanotron/checkpoints/0/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
Where you substitute 0 for batch size to config.

Expected behavior

Batch size is correctly detected and run finishes

Version info

git+ssh://[email protected]/huggingface/lighteval.git@80b460f496e729077850f379d40da88298489a8f#egg=lighteval

@hynky1999 hynky1999 added the bug Something isn't working label Sep 3, 2024
@huggingface huggingface deleted a comment Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant