Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatched Results on ScreenSpot #1

Open
boyugou opened this issue Jan 21, 2025 · 4 comments
Open

Mismatched Results on ScreenSpot #1

boyugou opened this issue Jan 21, 2025 · 4 comments

Comments

@boyugou
Copy link

boyugou commented Jan 21, 2025

Hi authors,

Congratulations on the impressive work.

I just have a minor question about the results on ScreenSpot:

I once double-checked with the author of SeeClick that, originally the results are reported in MACRO Avg (SeeClick, CogAgent, UGround). However, I do observe that recently people are mixing MACRO Avg and MICRO Avg results in their reports. Seems also true in this paper.

BTW, currently we would suggest using UGround-V1-7B for the Qwen-based model, and UGround for referring to the intial model in the first arxiv (They are using the same training data, trained by different arch.).

I can share more evaluation results of UGround (both versions) on the benchmarks if they can be useful.

@pooruss
Copy link
Collaborator

pooruss commented Jan 22, 2025

Hi Boyu,

Thanks for pointing that out! It's true that we did report the MICRO Avg results, and some work too. Below are some updated MACRO Avg results on ScreenSpot:

Aguvis-72B: 88.4
UI-TARS-2B: 81.2
UI-TARS-7B: 89.1
UI-TARS-70B: 88.2

@boyugou
Copy link
Author

boyugou commented Jan 22, 2025

Hi Boyu,

Thanks for pointing that out! It's true that we did report the MICRO Avg results, and some work too. Below are some updated MACRO Avg results on ScreenSpot:

Aguvis-72B: 88.4 UI-TARS-2B: 81.2 UI-TARS-7B: 89.1 UI-TARS-70B: 88.2

Hi Shihao,

Thanks for the prompt response! I do feel it's good to share MICRO scores, as people can easily calculate MACRO scores themselves.

I was not trying to criticize anyone. I just wanted to take this opportunity to propose that, maybe we could consider converging to a standard metric. Otherwise, the results people are reporting lately are getting increasingly messy.

@pooruss
Copy link
Collaborator

pooruss commented Jan 22, 2025

Can't agree more. We will update these results in the updated version report. Thanks!

@boyugou
Copy link
Author

boyugou commented Jan 22, 2025

Can't agree more. We will update these results in the updated version report. Thanks!

👍👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants