Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? #2377

jybbjybb · 2024-10-02T19:35:32Z

I have tested the LLaMA3.2_vision_90B_instruct on task "mmmu_eval", the result is as follows. The accuracy is 0.43, but the Meta's huggingface model says it is 60.3. I think Meta's results have CoT, but this mmmu_eval may not have. Can CoT explain this 60-43=17% difference?

It takes 11 hours on 4xA100 GPU to finish this 900 questions. Is it a reasonable time?

The command to run the test I use is

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m lm_eval --model hf-multimodal --model_args pretrained=/mnt/LLM_checkpoints/Llama-3.2-90B-Vision-Instruct/,parallelize=True --tasks mmmu_val  --batch_size 1

Groups	Filter	Metric		Value		Stderr
mmmu_val	none	acc	↑	0.4300	±	0.0163
- Art and Design	none	acc	↑	0.5333	±	0.0454
- Business	none	acc	↑	0.3467	±	0.0393
- Health and Medicine	none	acc	↑	0.5067	±	0.0412
- Humanities and Social Science	none	acc	↑	0.5750	±	0.0451
- Science	none	acc	↑	0.3467	±	0.0392
- Tech and Engineering	none	acc	↑	0.3524	±	0.0329

The text was updated successfully, but these errors were encountered:

jybbjybb · 2024-10-03T08:26:18Z

I add one arguments "--apply_chat_template" and the accuracy increases to 54.78%. But still short of Meta's claim on huggingface repo (60.3%). The command to run this time is

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m lm_eval --model hf-multimodal --model_args pretrained=/mnt/LLM_checkpoints/Llama-3.2-90B-Vision-Instruct/,parallelize=True --tasks mmmu_val  --batch_size 1 --apply_chat_template

Groups	Filter	Metric		Value		Stderr
mmmu_val	none	acc	↑	0.5478	±	0.0158
- Art and Design	none	acc	↑	0.6833	±	0.0358
- Business	none	acc	↑	0.5267	±	0.0411
- Health and Medicine	none	acc	↑	0.5933	±	0.0403
- Humanities and Social Science	none	acc	↑	0.7417	±	0.0400
- Science	none	acc	↑	0.4667	±	0.0409
- Tech and Engineering	none	acc	↑	0.4000	±	0.0331

BabyChouSr · 2024-10-17T21:21:20Z

For 11B Vision base, I got some numbers that were pretty off too:
Command: lm_eval --model hf-multimodal --model_args pretrained=meta-llama/Llama-3.2-11B-Vision,dtype=bfloat16,max_images=2,parallelize=True --tasks mmmu_val --batch_size 32

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmmu_val                               |      0|none  |      |acc   |↑  |0.2667|±  |0.0147|
| - Art and Design                      |      0|none  |      |acc   |↑  |0.3250|±  |0.0427|
|  - Art                                |      0|none  |     0|acc   |↑  |0.2667|±  |0.0821|
|  - Art Theory                         |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Design                             |      0|none  |     0|acc   |↑  |0.3333|±  |0.0875|
|  - Music                              |      0|none  |     0|acc   |↑  |0.4667|±  |0.0926|
| - Business                            |      0|none  |      |acc   |↑  |0.2733|±  |0.0369|
|  - Accounting                         |      0|none  |     0|acc   |↑  |0.3333|±  |0.0875|
|  - Economics                          |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
|  - Finance                            |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Manage                             |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Marketing                          |      0|none  |     0|acc   |↑  |0.2667|±  |0.0821|
| - Health and Medicine                 |      0|none  |      |acc   |↑  |0.3200|±  |0.0385|
|  - Basic Medical Science              |      0|none  |     0|acc   |↑  |0.3333|±  |0.0875|
|  - Clinical Medicine                  |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Diagnostics and Laboratory Medicine|      0|none  |     0|acc   |↑  |0.3667|±  |0.0895|
|  - Pharmacy                           |      0|none  |     0|acc   |↑  |0.3667|±  |0.0895|
|  - Public Health                      |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
| - Humanities and Social Science       |      0|none  |      |acc   |↑  |0.2917|±  |0.0407|
|  - History                            |      0|none  |     0|acc   |↑  |0.1333|±  |0.0631|
|  - Literature                         |      0|none  |     0|acc   |↑  |0.2667|±  |0.0821|
|  - Psychology                         |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
|  - Sociology                          |      0|none  |     0|acc   |↑  |0.4667|±  |0.0926|
| - Science                             |      0|none  |      |acc   |↑  |0.1867|±  |0.0319|
|  - Biology                            |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|
|  - Chemistry                          |      0|none  |     0|acc   |↑  |0.1333|±  |0.0631|
|  - Geography                          |      0|none  |     0|acc   |↑  |0.1667|±  |0.0692|
|  - Math                               |      0|none  |     0|acc   |↑  |0.1333|±  |0.0631|
|  - Physics                            |      0|none  |     0|acc   |↑  |0.2000|±  |0.0743|
| - Tech and Engineering                |      0|none  |      |acc   |↑  |0.2333|±  |0.0295|
|  - Agriculture                        |      0|none  |     0|acc   |↑  |0.1667|±  |0.0692|
|  - Architecture and Engineering       |      0|none  |     0|acc   |↑  |0.2333|±  |0.0785|
|  - Computer Science                   |      0|none  |     0|acc   |↑  |0.3000|±  |0.0851|

jybbjybb · 2024-10-18T05:01:09Z

Meta should have some secret prompts to increase the accuracy. Adding --apply_chat_template is one of them, but not enough.

BabyChouSr · 2024-10-18T05:04:38Z

For the above, I was using the base model. I used apply chat template and I got 0.4722 for Llama 11B-instruct with max_images = 5 (GPU size reasons). Let me know if you get something similar!

jybbjybb · 2024-10-23T03:24:45Z

For the above, I was using the base model. I used apply chat template and I got 0.4722 for Llama 11B-instruct with max_images = 5 (GPU size reasons). Let me know if you get something similar!

0.4722 is a reasonable number, you can check https://github.com/jybbjybb/llama_quant/blob/main/LLaMA3.2.md for detailed results.

jybbjybb changed the title ~~Is this result on mmmu_val reasonable?~~ Is this result (acc=0.43) on mmmu_val reasonable? Oct 2, 2024

jybbjybb changed the title ~~Is this result (acc=0.43) on mmmu_val reasonable?~~ Is LLaMA3.2-Vision-90B/11B results on mmmu_val reproducible? Oct 3, 2024

jybbjybb changed the title ~~Is LLaMA3.2-Vision-90B/11B results on mmmu_val reproducible?~~ Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? Oct 3, 2024

baberabb added the validation For validation of task implementations. label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? #2377

Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? #2377

jybbjybb commented Oct 2, 2024 •

edited

Loading

jybbjybb commented Oct 3, 2024

BabyChouSr commented Oct 17, 2024

jybbjybb commented Oct 18, 2024

BabyChouSr commented Oct 18, 2024

jybbjybb commented Oct 23, 2024

Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? #2377

Is LLaMA3.2-Vision-90B/11B result on mmmu_val reproducible? #2377

Comments

jybbjybb commented Oct 2, 2024 • edited Loading

jybbjybb commented Oct 3, 2024

BabyChouSr commented Oct 17, 2024

jybbjybb commented Oct 18, 2024

BabyChouSr commented Oct 18, 2024

jybbjybb commented Oct 23, 2024

jybbjybb commented Oct 2, 2024 •

edited

Loading