-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatched Results on ScreenSpot #1
Comments
Hi Boyu, Thanks for pointing that out! It's true that we did report the MICRO Avg results, and some work too. Below are some updated MACRO Avg results on ScreenSpot: Aguvis-72B: 88.4 |
Hi Shihao, Thanks for the prompt response! I do feel it's good to share MICRO scores, as people can easily calculate MACRO scores themselves. I was not trying to criticize anyone. I just wanted to take this opportunity to propose that, maybe we could consider converging to a standard metric. Otherwise, the results people are reporting lately are getting increasingly messy. |
Can't agree more. We will update these results in the updated version report. Thanks! |
👍👍 |
Hi authors,
Congratulations on the impressive work.
I just have a minor question about the results on ScreenSpot:
I once double-checked with the author of SeeClick that, originally the results are reported in MACRO Avg (SeeClick, CogAgent, UGround). However, I do observe that recently people are mixing MACRO Avg and MICRO Avg results in their reports. Seems also true in this paper.
BTW, currently we would suggest using UGround-V1-7B for the Qwen-based model, and UGround for referring to the intial model in the first arxiv (They are using the same training data, trained by different arch.).
I can share more evaluation results of UGround (both versions) on the benchmarks if they can be useful.
The text was updated successfully, but these errors were encountered: