-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? #2377
Comments
I add one arguments "--apply_chat_template" and the accuracy increases to 54.78%. But still short of Meta's claim on huggingface repo (60.3%). The command to run this time is
|
For 11B Vision base, I got some numbers that were pretty off too:
|
Meta should have some secret prompts to increase the accuracy. Adding --apply_chat_template is one of them, but not enough. |
For the above, I was using the base model. I used apply chat template and I got 0.4722 for Llama 11B-instruct with max_images = 5 (GPU size reasons). Let me know if you get something similar! |
0.4722 is a reasonable number, you can check https://github.com/jybbjybb/llama_quant/blob/main/LLaMA3.2.md for detailed results. |
I have tested the LLaMA3.2_vision_90B_instruct on task "mmmu_eval", the result is as follows. The accuracy is 0.43, but the Meta's huggingface model says it is 60.3. I think Meta's results have CoT, but this mmmu_eval may not have. Can CoT explain this 60-43=17% difference?
It takes 11 hours on 4xA100 GPU to finish this 900 questions. Is it a reasonable time?
The command to run the test I use is
The text was updated successfully, but these errors were encountered: