-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement new agent using AutoCodeRover's approach #942
Comments
It now supports running on GitHub and local issues! |
I don't think implementing autocoderover is high-priority given that we have better performance! https://xwang.dev/blog/2024/opendevin-codeact-1.0-swebench/ |
AutoCodeRover authors actually claim to resolve ~22% issues of SWE-bench lite. Why is the blog post https://xwang.dev/blog/2024/opendevin-codeact-1.0-swebench/ saying AutoCodeRover achieves just 16%? |
@dsemba see here: #1693 (comment) |
Sorry for commenting on this closed issue and thank you for your interests in AutoCodeRover! I would like to update on the pass@1 and pass@3 scores in the original AutoCodeRover paper. Turns out that the SWE-bench evaluation environment used in our original experiments gives underestimated scores due to missing of system-level dependencies. Some correct patches were deemed as wrong after running the SWE-bench acceptance tests in that environment. Thanks to the SWE-bench-docker project, our original patches were re-evaluated, and the actual pass@1 score is 19% (instead of 16%), and the pass@3 score is 26% (instead of 22%). More details can be found here. The 19% pass@1 score is also reflected on SWE-bench leaderboard. |
@neubig I wouldn't necessarily claim that having a higher overall score means that OpenDevin couldn't benefit even more from techniques used in AutoCodeRover (or some other tool). IMO, to 'properly' make that assessment you would need to be able to isolate/test how well their methods (eg. AST construction/search) compare against OpenDevin's equivalent methods. It may be that OpenDevin currently does better due to other parts, but could benefit from the new technique used here. Though perhaps you have already looked deeper than the above comment suggests, and so have a more 'evidenced' view as to why you don't think there would be improvements to be gained.
In the space of AST parsing / better code 'repomaps', see also: |
Also, looking at the repo, looks like AutoCodeRover is now much higher than OpenDevin on SWE-Bench lite (at least based on the 22% reported in the linked blog post):
Edit: Maybe these results are more relevant/up to date than that OpenDevin blog post though? Which seems like OpenDevin CodeActAgent (v1.3) + |
AutoCodeRover from NUS claims 22% on swe-bench-lite.
Their approach constructs an AST from a repo codebase to identify where in the code a patch needs to be applied.
Implement an agent based on ACR's approach.
https://arxiv.org/abs/2404.05427
https://github.com/nus-apr/auto-code-rover
The text was updated successfully, but these errors were encountered: