You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your awesome work! VisionLLM opens a way towards a generalist vision and language model.
However, from the result in the single task vs. multiple tasks in ablation study, it seems that multi-task training hurts the performance, what do you think caused this? Is the training data not large enough? OFA also introduce coordinate tokens and find that multi-task learning can improve performance. Thanks in advance :)
The text was updated successfully, but these errors were encountered:
Hi, thanks for this question and apologize for the delayed response. Regarding the performance degradation observed in multi-task training, several factors could contribute to this result. First, we only used COCO data, which may not be enough; Second, it may be that multi-task training requires a longer training schedule to achieve comparable performance; Third, sharing parameters for multi-task training exists the task-interference issue.
Compared to specialized models with specific parameters for each task, generalist models with shared parameters would suffer from the task-interference issue — different tasks with shared parameters may conflict with each other [88]. The same issue is also observed in multilingual NLP models [4, 81, 83]. We argue that the task-interference issue is mainly caused by the inconsistent optimization in multi-task learning. As shown in Tab. 1, during the training phase of generalist models, the gradient directions of different tasks would be inconsistent or even opposite. Thus, if multiple tasks share parameters, the optimal update direction of the shared parameters will be uncertain, resulting in sub-optimal performance.
Thanks for your awesome work! VisionLLM opens a way towards a generalist vision and language model.
However, from the result in the single task vs. multiple tasks in ablation study, it seems that multi-task training hurts the performance, what do you think caused this? Is the training data not large enough? OFA also introduce coordinate tokens and find that multi-task learning can improve performance. Thanks in advance :)
The text was updated successfully, but these errors were encountered: