-
Notifications
You must be signed in to change notification settings - Fork 485
Description
Is this a new feature, an improvement, or a change to existing functionality?
Improvement
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem this feature solves
The current full predictor in the swe_bench evaluation example utilizes a one-shot generation approach, which lacks the necessary robustness for complex swe_bench tasks. Based on my evaluation of 8 instances across major swe_bench projects (including SymPy, Astropy, Django, Matplotlib), the current success rate is 0%.
The core limitations identified are:
- Generates fixes without running tests to validate them
- Lacks feedback loops to refine solutions based on execution errors
- Cannot recover from failures or adjust strategies
- Relies on static code analysis without dynamic execution feedback
nat eval --config_file examples/evaluation_and_profiling/swe_bench/configs/config_full.yml=== EVALUATION SUMMARY ===
Workflow Status: COMPLETED (workflow_output.json)
Total Runtime: 132.62s
Per evaluator results:
| Evaluator | Avg Score | Output File |
|-------------|-------------|-----------------------|
| swe_bench | 0 | swe_bench_output.json |
Describe your ideal solution
I propose the implementation of an Iterative Predictor that introduces a dynamic feedback loop into the SWE-bench resolution process. This feature will transition the agent from a "one-shot" model to an "reason-action-observation" model.
Key Components of the Solution:
- Step-by-step execution: Executes commands incrementally and observes results
- Test-driven validation: Runs tests after each fix attempt and uses failure signals to guide refinement
- Error recovery: Handles failures gracefully with retry mechanisms and strategy adjustments
- Dynamic feedback: Uses runtime errors, test outputs, and execution results instead of static analysis
Additional context
I plan to implement this iterative predictor. Will extend the SweBenchPredictorBase class and reuse the existing environment interaction logic to ensure consistency with the current framework. Once the implementation is verified, I will submit a PR for review.
Code of Conduct
- I agree to follow this project's Code of Conduct
- I have searched the open feature requests and have found no duplicates for this feature request