I’m reproducing the results of the ReMe paper on the AppWorld benchmark and have questions about how the reported metrics are computed. In particular, I noticed the evaluation traces include fields like before_score, after_score, uplift_score, and that the runner seems to have early stopping logic (stop when task is completed). These details affect how to compute metrics such as avg@4.
The evaluation code appears to support early stopping . For computing the paper metric avg@4, do we need to run exactly 4 rollouts/trajectories per task even if the task succeeds earlier?
When computing the final metric reported in the paper, should we use after_score or uplift_score?
after_score seems to reflect the post-run outcome (0–1 float).
uplift_score seems to be after_score - before_score .
Which one corresponds to the paper’s main score? If both are used in different places, could you clarify which metric maps to which field?
The score values in traces are floats in [0, 1]. For computing success/accuracy-style metrics, should we treat scores as:
binary success: count only score == 1.0 as success, else failure, or
fractional credit: directly average the float scores ?