Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? #2377

Open
jybbjybb opened this issue Oct 2, 2024 · 5 comments
Open

Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? #2377

jybbjybb opened this issue Oct 2, 2024 · 5 comments
Labels
validation For validation of task implementations.

Comments

@jybbjybb
Copy link

jybbjybb commented Oct 2, 2024

I have tested the LLaMA3.2_vision_90B_instruct on task "mmmu_eval", the result is as follows. The accuracy is 0.43, but the Meta's huggingface model says it is 60.3. I think Meta's results have CoT, but this mmmu_eval may not have. Can CoT explain this 60-43=17% difference?

It takes 11 hours on 4xA100 GPU to finish this 900 questions. Is it a reasonable time?

The command to run the test I use is

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m lm_eval --model hf-multimodal --model_args pretrained=/mnt/LLM_checkpoints/Llama-3.2-90B-Vision-Instruct/,parallelize=True --tasks mmmu_val  --batch_size 1
Groups Version Filter n-shot Metric Value Stderr
mmmu_val 0 none acc 0.4300 ± 0.0163
- Art and Design 0 none acc 0.5333 ± 0.0454
- Business 0 none acc 0.3467 ± 0.0393
- Health and Medicine 0 none acc 0.5067 ± 0.0412
- Humanities and Social Science 0 none acc 0.5750 ± 0.0451
- Science 0 none acc 0.3467 ± 0.0392
- Tech and Engineering 0 none acc 0.3524 ± 0.0329
@jybbjybb jybbjybb changed the title Is this result on mmmu_val reasonable? Is this result (acc=0.43) on mmmu_val reasonable? Oct 2, 2024
@jybbjybb jybbjybb changed the title Is this result (acc=0.43) on mmmu_val reasonable? Is LLaMA3.2-Vision-90B/11B results on mmmu_val reproducible? Oct 3, 2024
@jybbjybb
Copy link
Author

jybbjybb commented Oct 3, 2024

I add one arguments "--apply_chat_template" and the accuracy increases to 54.78%. But still short of Meta's claim on huggingface repo (60.3%). The command to run this time is

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m lm_eval --model hf-multimodal --model_args pretrained=/mnt/LLM_checkpoints/Llama-3.2-90B-Vision-Instruct/,parallelize=True --tasks mmmu_val  --batch_size 1 --apply_chat_template
Groups Version Filter n-shot Metric Value Stderr
mmmu_val 0 none acc 0.5478 ± 0.0158
- Art and Design 0 none acc 0.6833 ± 0.0358
- Business 0 none acc 0.5267 ± 0.0411
- Health and Medicine 0 none acc 0.5933 ± 0.0403
- Humanities and Social Science 0 none acc 0.7417 ± 0.0400
- Science 0 none acc 0.4667 ± 0.0409
- Tech and Engineering 0 none acc 0.4000 ± 0.0331

@jybbjybb jybbjybb changed the title Is LLaMA3.2-Vision-90B/11B results on mmmu_val reproducible? Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? Oct 3, 2024
@baberabb baberabb added the validation For validation of task implementations. label Oct 4, 2024
@BabyChouSr
Copy link

For 11B Vision base, I got some numbers that were pretty off too:
Command: lm_eval --model hf-multimodal --model_args pretrained=meta-llama/Llama-3.2-11B-Vision,dtype=bfloat16,max_images=2,parallelize=True --tasks mmmu_val --batch_size 32

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmmu_val                               |      0|none  |      |acc   |↑  |0.2667|±  |0.0147|
| - Art and Design                      |      0|none  |      |acc   |↑  |0.3250|±  |0.0427|
|  - Art                                |      0|none  |     0|acc   |↑  |0.2667|±  |0.0821|
|  - Art Theory                         |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Design                             |      0|none  |     0|acc   |↑  |0.3333|±  |0.0875|
|  - Music                              |      0|none  |     0|acc   |↑  |0.4667|±  |0.0926|
| - Business                            |      0|none  |      |acc   |↑  |0.2733|±  |0.0369|
|  - Accounting                         |      0|none  |     0|acc   |↑  |0.3333|±  |0.0875|
|  - Economics                          |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
|  - Finance                            |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Manage                             |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Marketing                          |      0|none  |     0|acc   |↑  |0.2667|±  |0.0821|
| - Health and Medicine                 |      0|none  |      |acc   |↑  |0.3200|±  |0.0385|
|  - Basic Medical Science              |      0|none  |     0|acc   |↑  |0.3333|±  |0.0875|
|  - Clinical Medicine                  |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Diagnostics and Laboratory Medicine|      0|none  |     0|acc   |↑  |0.3667|±  |0.0895|
|  - Pharmacy                           |      0|none  |     0|acc   |↑  |0.3667|±  |0.0895|
|  - Public Health                      |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
| - Humanities and Social Science       |      0|none  |      |acc   |↑  |0.2917|±  |0.0407|
|  - History                            |      0|none  |     0|acc   |↑  |0.1333|±  |0.0631|
|  - Literature                         |      0|none  |     0|acc   |↑  |0.2667|±  |0.0821|
|  - Psychology                         |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
|  - Sociology                          |      0|none  |     0|acc   |↑  |0.4667|±  |0.0926|
| - Science                             |      0|none  |      |acc   |↑  |0.1867|±  |0.0319|
|  - Biology                            |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
|  - Chemistry                          |      0|none  |     0|acc   |↑  |0.1333|±  |0.0631|
|  - Geography                          |      0|none  |     0|acc   |↑  |0.1667|±  |0.0692|
|  - Math                               |      0|none  |     0|acc   |↑  |0.1333|±  |0.0631|
|  - Physics                            |      0|none  |     0|acc   |↑  |0.2000|±  |0.0743|
| - Tech and Engineering                |      0|none  |      |acc   |↑  |0.2333|±  |0.0295|
|  - Agriculture                        |      0|none  |     0|acc   |↑  |0.1667|±  |0.0692|
|  - Architecture and Engineering       |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Computer Science                   |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|

@jybbjybb
Copy link
Author

Meta should have some secret prompts to increase the accuracy. Adding --apply_chat_template is one of them, but not enough.

@BabyChouSr
Copy link

For the above, I was using the base model. I used apply chat template and I got 0.4722 for Llama 11B-instruct with max_images = 5 (GPU size reasons). Let me know if you get something similar!

@jybbjybb
Copy link
Author

For the above, I was using the base model. I used apply chat template and I got 0.4722 for Llama 11B-instruct with max_images = 5 (GPU size reasons). Let me know if you get something similar!

0.4722 is a reasonable number, you can check https://github.com/jybbjybb/llama_quant/blob/main/LLaMA3.2.md for detailed results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
validation For validation of task implementations.
Projects
None yet
Development

No branches or pull requests

3 participants