This repository was archived by the owner on Nov 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
This repository was archived by the owner on Nov 3, 2023. It is now read-only.
Cannot train Seeker with batch size > 1 #4531
Copy link
Copy link
Open
Labels
donotreapAvoid automatically marking as stale.Avoid automatically marking as stale.
Description
Bug description
It seems that Seeker training command does not support batch size > 1. I ran into a FSDP error when training Seeker-400M with -bs 2.
Reproduction steps
python -m parlai.scripts.multiprocessing_train \
--task projects.seeker.tasks.knowledge,projects.seeker.tasks.dialogue,projects.seeker.tasks.search_query \
--multitask-weights 2,2,1 -bs 2 -vstep 1000 -vmt ppl -vp 5 -vmm min -vme 100000 -lstep 50 \
--init-opt arch/r2c2_base_400M --init-model zoo:seeker/r2c2_base_400M/model \
--model projects.seeker.agents.seeker:ComboFidGoldDocumentAgent --n-docs 5 \
--text-truncate 1000 --label-truncate 128 --truncate 1000 \
--fp16 True -lr 1e-06 --lr-scheduler reduceonplateau --optimizer adamw --save-after-valid True \
--warmup-updates 100 --update-freq 1 --gradient-clip 1.0 --skip-generation True --dropout 0.1 \
--attention-dropout 0.0 --load-from-checkpoint true --ddp-backend zero2 \
--checkpoint-activations true--model-file /tmp/my_seeker_dialogue_model
Expected behavior
I was hoping that training could succeed.
Logs
Please paste the command line output:
Asserting FSDP instance is: FullyShardedDataParallel(
world_size=8, flatten_parameters=True, mixed_precision=True,
(_fsdp_wrapped_module): FlattenParamsWrapper(
(_fpw_module): TransformerEncoderLayer_Swappable(
(attention): MultiHeadAttention(
(attn_dropout): Dropout(p=0.0, inplace=False)
(q_lin): Linear(in_features=1024, out_features=1024, bias=True)
(k_lin): Linear(in_features=1024, out_features=1024, bias=True)
(v_lin): Linear(in_features=1024, out_features=1024, bias=True)
(out_lin): Linear(in_features=1024, out_features=1024, bias=True)
)
(norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(ffn): TransformerFFN(
(relu_dropout): Dropout(p=0, inplace=False)
(lin1): Linear(in_features=1024, out_features=4096, bias=True)
(lin2): Linear(in_features=4096, out_features=1024, bias=True)
)
(norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
ERROR: expected to be in states [<TrainingState.BACKWARD_POST: 4>] but current state is TrainingState.BACKWARD_PRE
2022-04-30 23:01:08,795 CRITICAL | Traceback (most recent call last):
File "/data/kai/ParlAI/parlai/scripts/multiprocessing_train.py", line 45, in multiprocess_train
return single_train.TrainLoop(opt).train()
File "/data/kai/ParlAI/parlai/scripts/train_model.py", line 1000, in train
for _train_log in self.train_steps():
File "/data/kai/ParlAI/parlai/scripts/train_model.py", line 907, in train_steps
world.parley()
File "/data/kai/ParlAI/parlai/core/worlds.py", line 880, in parley
batch_act = self.batch_act(agent_idx, batch_observations[agent_idx])
File "/data/kai/ParlAI/parlai/core/worlds.py", line 848, in batch_act
batch_actions = a.batch_act(batch_observation)
File "/data/kai/ParlAI/parlai/agents/fid/fid.py", line 389, in batch_act
batch_reply = super().batch_act(observations)
File "/data/kai/ParlAI/parlai/core/torch_agent.py", line 2238, in batch_act
output = self.train_step(batch)
File "/data/kai/ParlAI/parlai/core/torch_generator_agent.py", line 736, in train_step
self.backward(loss)
File "/data/kai/ParlAI/parlai/core/torch_agent.py", line 2324, in backward
self.optimizer.backward(loss, update_main_grads=False)
File "/data/kai/ParlAI/parlai/utils/fp16.py", line 194, in backward
loss.backward()
File "/data/kai/miniconda3/envs/parlai/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/data/kai/miniconda3/envs/parlai/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f0ce3a8e9e0> returned NULL without setting an error
Additional context
Not sure if this is a bug or a feature request.
Metadata
Metadata
Assignees
Labels
donotreapAvoid automatically marking as stale.Avoid automatically marking as stale.