Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_LOAD_BALANCING_LOSS returns empty list sometimes #113

Open
exnx opened this issue May 22, 2024 · 1 comment
Open

_LOAD_BALANCING_LOSS returns empty list sometimes #113

exnx opened this issue May 22, 2024 · 1 comment

Comments

@exnx
Copy link

exnx commented May 22, 2024

Hello, I am using Eleuther AI's gpt-neox implementation with megablocks, but I get 2 errors related to the _LOAD_BALANCING_LOSS.

  1. the tokens_per_expert gives me this error at this line. ValueError: Expected 14 token_per_experts but found 7. Here's the stack trace.
  File "/home/etnguyen/test/savanna/train.py", line 10, in <module>                                                                                                                             [53/1963]
    pretrain(global_config=global_config)                                                                                                                                                                
  File "/home/etnguyen/test/savanna/savanna/training.py", line 228, in pretrain                                                                                                                          
    iteration = train(                                                                                                                                                                                   
                ^^^^^^                                                                                                                                                                                   
  File "/home/etnguyen/test/savanna/savanna/training.py", line 1004, in train                                                                                                                            
    loss_dict, skipped_iter = train_step(                                                                                                                                                                
                              ^^^^^^^^^^^                                                                                                                                                                
  File "/home/etnguyen/test/savanna/savanna/training.py", line 919, in train_step                                                                                                                        
    loss = forward_step(                                                                                                                                                                                 
           ^^^^^^^^^^^^^                                                                                                                                                                                 
  File "/home/etnguyen/test/savanna/savanna/training.py", line 515, in forward_step                                                                                                                      
    moe_loss = mb_moe_loss_func(global_config, loss_mask, outputs)[0]                                                                                                                                    
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                       
  File "/home/etnguyen/test/savanna/savanna/training.py", line 464, in mb_moe_loss_func                                                                                                                  
    lbl = moe.batched_load_balancing_loss(megablocks_args)                                                                                                                                               
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
  File "/home/etnguyen/.local/lib/python3.11/site-packages/megablocks/layers/moe.py", line 43, in batched_load_balancing_loss                                                                            
    raise ValueError(                                                                                                                                                                                    
ValueError: Expected 14 token_per_experts but found 7.                                                                                                                                                   
num_layers = 14                                                                                                                                                                                          
pipeline_model_parallel_size = 1                                                                                                                                                                         
num_layers_per_virtual_pipeline_stage = None  

I get this error when the expert_interval=2, ie when the default value is used, and so the number of experts is actually half the number of layers (14 layers, 7 Megablocks layers used). This error gets fixed when I set the expert_interval=1 so that there are 14 Megablocks, and 14 layers. But I don't know the root cause of this discrepancy, especially if I want to change the expert_interval and number of Megablocks to every other layer.

  1. The second issue is, let's say I do use expert_interval=1 to get around the issue above, so every layer uses a Megablock, then my next error I get is that the return value of get_load_balancing_loss occasionally returns an empty list, which then errors out, meaning the _LOAD_BALANCING_LOSS is an empty list. Critically, this happens part way through training, like 30 secs in, so some batches it's fine and returns the expected losses.

Does this sound familiar to anybody? I'd very much appreciate any insights, thank you!

@mvpatel2000
Copy link
Contributor

  1. Can you double check num_layers is appropriately passed in Arguments to dmoe/moe when expert_interval=2? It should be equal to 7.
  2. Hm... I'm a little less sure since I'm not as familiar with Eleuther's harness. You can check out an example with LLMFoundry here if helpful. I'm not sure what's going on here... it should save every time you do a forward pass 🤔. Could you ensure it is in train mode? If the model is in eval mode, it won't save LBL loss, and then if you try to backprop it will error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants