This is the replication package for our paper Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study,
containing the code for PatchDiff, the dataset we used for our study, and the results.
PatchDiff is an LLM-based repository-level differential testing tool, designed for differentiating two patches applied in the same repository.
- Language: Python (Python 3.10)
- Key Python Packages:
An environment.yml is also provided for easy replication.
-
data/: dataset and resultsdataset/swebench_verified: the SWE-bench Verified dataset collected from Hugging Face princeton-nlp/SWE-bench_Verifiedtool_results/: the generated patches of the evaluated tools (i.e., OpenHands, LearnByInteract, CodeStory) from the SWE-bench leaderboard repository
-
src/: code for PatchDiffbenchmark.py: code for loading datasetconfig.py: configuration fileframework.py: the framework for PatchDiff and the main entrance for executingextractor.py: code for extracting target functions and context functions and constructing contextual codellm.py: code for constructing prompts and invoking LLMsparser.py: code for parsing python code files in the target repositoriesdokcer_helper.py: helper functions for operating the docker instances, where the test execution takes placeinstance_logger.py: implementation of a concurrent loggerutils.py: extra helper functionsswebench.py: helper functions from the SWE-bench validation frameworkswebench_constants.py: constants definition from the SWE-bench validation frameworkswebench_log_parsers.py: log parsers from the SWE-bench validation framework, used to parse the testing outputs
You can find our results here: https://doi.org/10.5281/zenodo.17074796
Edit config.py and fill in the following fields:
LOGGING: absolute path to a logging directory. Logs and results will be outputted in this directory.TEMP_PATH: absolute path to a directory to hold temporary files. Please ensure that your disk have at least 200GB available spaceRUNALL_CACHE_PATH: absolute patch to a directory to hold cache of executing all developer tests.SWEBENCH_REPO: absolute path to the swebench repository. If you don't have this repository, please first clone it to your environment.OPENAI_API_KEY: an API key for OpenAI LLMs
Run the following command to run all available tests:
$ python -m src.framework -o -t {tool_name} tool_name: the tool to be evaluated. Should be one ofOpenHands,LearnByInteract, andCodeStory.
Extra options:
-m: number of concurrent tasks, default to 1. e.g.,-m 48-i: the instance to run. Without this option, PatchDiff will run all developer tests under all plausible patches. e.g.,-i django__django-16527--remove_image: whether to remove docker images after running PatchDiff, default to False. Holding all docker images for SWE-bench tasks could take a lot of disk space (~3GB per task). If you want to safe space, please append this option to your command.--no_cache_runall: whether to mute caching of runall results after running all developer tests, default to False. Running all developer tests could take a long time, caching the results frees you from this problem the next time you need them (e.g., in RQ2). WARNING: If caching is enabled, this could take ~420GB spaces for the three evaluated tools all together. If you have limited disk space, please consider muting it by setting this option.--timeout: timeout (in seconds) for running a test, default to 3600 (an hour).
Run python -m src.framework --help for more information.
Run the following command to generate differentiating tests:
$ python -m src.framework -t {tool_name} Apart from the extra options described before, there are two more options:
-n: number of choices for LLMs to generate for each request, default to 10. e.g.,-n 5-r: number of attempts to repair a generated test, default to 2. e.g.,-r 1
After running the above command, run the following command to filter out flaky tests (This step is isolated because it could take a long time, and you will not obtain any results in the middle):
$ python -m src.framework --filter_flaky -t {tool_name} Option -m, --timeout, and --remove_image are allowed.