You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am using Eleuther AI's gpt-neox implementation with megablocks, but I get 2 errors related to the _LOAD_BALANCING_LOSS.
the tokens_per_expert gives me this error at this line. ValueError: Expected 14 token_per_experts but found 7. Here's the stack trace.
File "/home/etnguyen/test/savanna/train.py", line 10, in <module> [53/1963]
pretrain(global_config=global_config)
File "/home/etnguyen/test/savanna/savanna/training.py", line 228, in pretrain
iteration = train(
^^^^^^
File "/home/etnguyen/test/savanna/savanna/training.py", line 1004, in train
loss_dict, skipped_iter = train_step(
^^^^^^^^^^^
File "/home/etnguyen/test/savanna/savanna/training.py", line 919, in train_step
loss = forward_step(
^^^^^^^^^^^^^
File "/home/etnguyen/test/savanna/savanna/training.py", line 515, in forward_step
moe_loss = mb_moe_loss_func(global_config, loss_mask, outputs)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/etnguyen/test/savanna/savanna/training.py", line 464, in mb_moe_loss_func
lbl = moe.batched_load_balancing_loss(megablocks_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/etnguyen/.local/lib/python3.11/site-packages/megablocks/layers/moe.py", line 43, in batched_load_balancing_loss
raise ValueError(
ValueError: Expected 14 token_per_experts but found 7.
num_layers = 14
pipeline_model_parallel_size = 1
num_layers_per_virtual_pipeline_stage = None
I get this error when the expert_interval=2, ie when the default value is used, and so the number of experts is actually half the number of layers (14 layers, 7 Megablocks layers used). This error gets fixed when I set the expert_interval=1 so that there are 14 Megablocks, and 14 layers. But I don't know the root cause of this discrepancy, especially if I want to change the expert_interval and number of Megablocks to every other layer.
The second issue is, let's say I do use expert_interval=1 to get around the issue above, so every layer uses a Megablock, then my next error I get is that the return value of get_load_balancing_loss occasionally returns an empty list, which then errors out, meaning the _LOAD_BALANCING_LOSS is an empty list. Critically, this happens part way through training, like 30 secs in, so some batches it's fine and returns the expected losses.
Does this sound familiar to anybody? I'd very much appreciate any insights, thank you!
The text was updated successfully, but these errors were encountered:
Can you double check num_layers is appropriately passed in Arguments to dmoe/moe when expert_interval=2? It should be equal to 7.
Hm... I'm a little less sure since I'm not as familiar with Eleuther's harness. You can check out an example with LLMFoundry here if helpful. I'm not sure what's going on here... it should save every time you do a forward pass 🤔. Could you ensure it is in train mode? If the model is in eval mode, it won't save LBL loss, and then if you try to backprop it will error
Hello, I am using Eleuther AI's gpt-neox implementation with megablocks, but I get 2 errors related to the
_LOAD_BALANCING_LOSS
.tokens_per_expert
gives me this error at this line.ValueError: Expected 14 token_per_experts but found 7.
Here's the stack trace.I get this error when the
expert_interval=2
, ie when the default value is used, and so the number of experts is actually half the number of layers (14 layers, 7 Megablocks layers used). This error gets fixed when I set theexpert_interval=1
so that there are 14 Megablocks, and 14 layers. But I don't know the root cause of this discrepancy, especially if I want to change theexpert_interval
and number of Megablocks to every other layer.expert_interval=1
to get around the issue above, so every layer uses a Megablock, then my next error I get is that the return value of get_load_balancing_loss occasionally returns an empty list, which then errors out, meaning the_LOAD_BALANCING_LOSS
is an empty list. Critically, this happens part way through training, like 30 secs in, so some batches it's fine and returns the expected losses.Does this sound familiar to anybody? I'd very much appreciate any insights, thank you!
The text was updated successfully, but these errors were encountered: