Skip to content

Clarification on AppWorld metrics computation in ReMe paper: early stopping, avg@4, score definition (after_score vs uplift_score), and binary vs fractional scoring #71

@nzj1120

Description

@nzj1120

I’m reproducing the results of the ReMe paper on the AppWorld benchmark and have questions about how the reported metrics are computed. In particular, I noticed the evaluation traces include fields like before_score, after_score, uplift_score, and that the runner seems to have early stopping logic (stop when task is completed). These details affect how to compute metrics such as avg@4.

The evaluation code appears to support early stopping . For computing the paper metric avg@4, do we need to run exactly 4 rollouts/trajectories per task even if the task succeeds earlier?

When computing the final metric reported in the paper, should we use after_score or uplift_score?
after_score seems to reflect the post-run outcome (0–1 float).
uplift_score seems to be after_score - before_score .
Which one corresponds to the paper’s main score? If both are used in different places, could you clarify which metric maps to which field?

The score values in traces are floats in [0, 1]. For computing success/accuracy-style metrics, should we treat scores as:
binary success: count only score == 1.0 as success, else failure, or
fractional credit: directly average the float scores ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions