Skip to content

Add Iterative Predictor for Improved SWE-bench Issue Resolution #1397

@Jerryguan777

Description

@Jerryguan777

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

The current full predictor in the swe_bench evaluation example utilizes a one-shot generation approach, which lacks the necessary robustness for complex swe_bench tasks. Based on my evaluation of 8 instances across major swe_bench projects (including SymPy, Astropy, Django, Matplotlib), the current success rate is 0%.

The core limitations identified are:

  • Generates fixes without running tests to validate them
  • Lacks feedback loops to refine solutions based on execution errors
  • Cannot recover from failures or adjust strategies
  • Relies on static code analysis without dynamic execution feedback
nat eval --config_file examples/evaluation_and_profiling/swe_bench/configs/config_full.yml
=== EVALUATION SUMMARY ===
Workflow Status: COMPLETED (workflow_output.json)
Total Runtime: 132.62s

Per evaluator results:
| Evaluator   |   Avg Score | Output File           |
|-------------|-------------|-----------------------|
| swe_bench   |           0 | swe_bench_output.json |

Describe your ideal solution

I propose the implementation of an Iterative Predictor that introduces a dynamic feedback loop into the SWE-bench resolution process. This feature will transition the agent from a "one-shot" model to an "reason-action-observation" model.

Key Components of the Solution:

  • Step-by-step execution: Executes commands incrementally and observes results
  • Test-driven validation: Runs tests after each fix attempt and uses failure signals to guide refinement
  • Error recovery: Handles failures gracefully with retry mechanisms and strategy adjustments
  • Dynamic feedback: Uses runtime errors, test outputs, and execution results instead of static analysis

Additional context

I plan to implement this iterative predictor. Will extend the SweBenchPredictorBase class and reuse the existing environment interaction logic to ensure consistency with the current framework. Once the implementation is verified, I will submit a PR for review.

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs TriageNeed team to review and classifyfeature requestNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions