Mismatched Results on ScreenSpot #1

boyugou · 2025-01-21T21:21:51Z

Hi authors,

Congratulations on the impressive work.

I just have a minor question about the results on ScreenSpot:

I once double-checked with the author of SeeClick that, originally the results are reported in MACRO Avg (SeeClick, CogAgent, UGround). However, I do observe that recently people are mixing MACRO Avg and MICRO Avg results in their reports. Seems also true in this paper.

BTW, currently we would suggest using UGround-V1-7B for the Qwen-based model, and UGround for referring to the intial model in the first arxiv (They are using the same training data, trained by different arch.).

I can share more evaluation results of UGround (both versions) on the benchmarks if they can be useful.

pooruss · 2025-01-22T06:19:01Z

Hi Boyu,

Thanks for pointing that out! It's true that we did report the MICRO Avg results, and some work too. Below are some updated MACRO Avg results on ScreenSpot:

Aguvis-72B: 88.4
UI-TARS-2B: 81.2
UI-TARS-7B: 89.1
UI-TARS-70B: 88.2

boyugou · 2025-01-22T06:24:26Z

Hi Boyu,

Thanks for pointing that out! It's true that we did report the MICRO Avg results, and some work too. Below are some updated MACRO Avg results on ScreenSpot:

Aguvis-72B: 88.4 UI-TARS-2B: 81.2 UI-TARS-7B: 89.1 UI-TARS-70B: 88.2

Hi Shihao,

Thanks for the prompt response! I do feel it's good to share MICRO scores, as people can easily calculate MACRO scores themselves.

I was not trying to criticize anyone. I just wanted to take this opportunity to propose that, maybe we could consider converging to a standard metric. Otherwise, the results people are reporting lately are getting increasingly messy.

pooruss · 2025-01-22T06:54:09Z

Can't agree more. We will update these results in the updated version report. Thanks!

boyugou · 2025-01-22T06:55:08Z

Can't agree more. We will update these results in the updated version report. Thanks!

👍👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatched Results on ScreenSpot #1

Mismatched Results on ScreenSpot #1

boyugou commented Jan 21, 2025 •

edited

Loading

pooruss commented Jan 22, 2025

boyugou commented Jan 22, 2025 •

edited

Loading

pooruss commented Jan 22, 2025

boyugou commented Jan 22, 2025

Mismatched Results on ScreenSpot #1

Mismatched Results on ScreenSpot #1

Comments

boyugou commented Jan 21, 2025 • edited Loading

pooruss commented Jan 22, 2025

boyugou commented Jan 22, 2025 • edited Loading

pooruss commented Jan 22, 2025

boyugou commented Jan 22, 2025

boyugou commented Jan 21, 2025 •

edited

Loading

boyugou commented Jan 22, 2025 •

edited

Loading