Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Cannot train Seeker with batch size > 1 #4531

@zhangmozhi

Description

@zhangmozhi

Bug description
It seems that Seeker training command does not support batch size > 1. I ran into a FSDP error when training Seeker-400M with -bs 2.

Reproduction steps

python -m parlai.scripts.multiprocessing_train \
--task projects.seeker.tasks.knowledge,projects.seeker.tasks.dialogue,projects.seeker.tasks.search_query \
--multitask-weights 2,2,1 -bs 2 -vstep 1000 -vmt ppl -vp 5 -vmm min -vme 100000 -lstep 50 \
--init-opt arch/r2c2_base_400M --init-model zoo:seeker/r2c2_base_400M/model \
--model projects.seeker.agents.seeker:ComboFidGoldDocumentAgent --n-docs 5 \
--text-truncate 1000 --label-truncate 128 --truncate 1000 \
--fp16 True -lr 1e-06 --lr-scheduler reduceonplateau --optimizer adamw --save-after-valid True \
--warmup-updates 100 --update-freq 1 --gradient-clip 1.0 --skip-generation True --dropout 0.1 \
--attention-dropout 0.0 --load-from-checkpoint true --ddp-backend zero2 \
--checkpoint-activations true--model-file /tmp/my_seeker_dialogue_model

Expected behavior
I was hoping that training could succeed.

Logs
Please paste the command line output:

Asserting FSDP instance is: FullyShardedDataParallel(
  world_size=8, flatten_parameters=True, mixed_precision=True, 
  (_fsdp_wrapped_module): FlattenParamsWrapper(
    (_fpw_module): TransformerEncoderLayer_Swappable(
      (attention): MultiHeadAttention(
        (attn_dropout): Dropout(p=0.0, inplace=False)
        (q_lin): Linear(in_features=1024, out_features=1024, bias=True)
        (k_lin): Linear(in_features=1024, out_features=1024, bias=True)
        (v_lin): Linear(in_features=1024, out_features=1024, bias=True)
        (out_lin): Linear(in_features=1024, out_features=1024, bias=True)
      )
      (norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (ffn): TransformerFFN(
        (relu_dropout): Dropout(p=0, inplace=False)
        (lin1): Linear(in_features=1024, out_features=4096, bias=True)
        (lin2): Linear(in_features=4096, out_features=1024, bias=True)
      )
      (norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
)
ERROR: expected to be in states [<TrainingState.BACKWARD_POST: 4>] but current state is TrainingState.BACKWARD_PRE
2022-04-30 23:01:08,795 CRITICAL | Traceback (most recent call last):
  File "/data/kai/ParlAI/parlai/scripts/multiprocessing_train.py", line 45, in multiprocess_train
    return single_train.TrainLoop(opt).train()
  File "/data/kai/ParlAI/parlai/scripts/train_model.py", line 1000, in train
    for _train_log in self.train_steps():
  File "/data/kai/ParlAI/parlai/scripts/train_model.py", line 907, in train_steps
    world.parley()
  File "/data/kai/ParlAI/parlai/core/worlds.py", line 880, in parley
    batch_act = self.batch_act(agent_idx, batch_observations[agent_idx])
  File "/data/kai/ParlAI/parlai/core/worlds.py", line 848, in batch_act
    batch_actions = a.batch_act(batch_observation)
  File "/data/kai/ParlAI/parlai/agents/fid/fid.py", line 389, in batch_act
    batch_reply = super().batch_act(observations)
  File "/data/kai/ParlAI/parlai/core/torch_agent.py", line 2238, in batch_act
    output = self.train_step(batch)
  File "/data/kai/ParlAI/parlai/core/torch_generator_agent.py", line 736, in train_step
    self.backward(loss)
  File "/data/kai/ParlAI/parlai/core/torch_agent.py", line 2324, in backward
    self.optimizer.backward(loss, update_main_grads=False)
  File "/data/kai/ParlAI/parlai/utils/fp16.py", line 194, in backward
    loss.backward()
  File "/data/kai/miniconda3/envs/parlai/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/data/kai/miniconda3/envs/parlai/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f0ce3a8e9e0> returned NULL without setting an error

Additional context
Not sure if this is a bug or a feature request.

Metadata

Metadata

Assignees

Labels

donotreapAvoid automatically marking as stale.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions