Clarification on AppWorld metrics computation in ReMe paper: early stopping, avg@4, score definition (after_score vs uplift_score), and binary vs fractional scoring

I’m reproducing the results of the ReMe paper on the AppWorld benchmark and have questions about how the reported metrics are computed. In particular, I noticed the evaluation traces include fields like before_score, after_score, uplift_score, and that the runner seems to have early stopping logic (stop when task is completed). These details affect how to compute metrics such as avg@4.

The evaluation code appears to support early stopping . For computing the paper metric avg@4, do we need to run exactly 4 rollouts/trajectories per task even if the task succeeds earlier?

When computing the final metric reported in the paper, should we use after_score or uplift_score?
after_score seems to reflect the post-run outcome (0–1 float).
uplift_score seems to be after_score - before_score .
Which one corresponds to the paper’s main score? If both are used in different places, could you clarify which metric maps to which field?

The score values in traces are floats in [0, 1]. For computing success/accuracy-style metrics, should we treat scores as:
binary success: count only score == 1.0 as success, else failure, or
fractional credit: directly average the float scores ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on AppWorld metrics computation in ReMe paper: early stopping, avg@4, score definition (after_score vs uplift_score), and binary vs fractional scoring #71

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on AppWorld metrics computation in ReMe paper: early stopping, avg@4, score definition (after_score vs uplift_score), and binary vs fractional scoring #71

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions