PatchDiff

This is the replication package for our paper Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study, containing the code for PatchDiff, the dataset we used for our study, and the results.

PatchDiff is an LLM-based repository-level differential testing tool, designed for differentiating two patches applied in the same repository.

Environments

Language: Python (Python 3.10)
Key Python Packages:

An environment.yml is also provided for easy replication.

File Organization

data/: dataset and results
- dataset/swebench_verified: the SWE-bench Verified dataset collected from Hugging Face princeton-nlp/SWE-bench_Verified
- tool_results/: the generated patches of the evaluated tools (i.e., OpenHands, LearnByInteract, CodeStory) from the SWE-bench leaderboard repository
src/: code for PatchDiff
- benchmark.py: code for loading dataset
- config.py: configuration file
- framework.py: the framework for PatchDiff and the main entrance for executing
- extractor.py: code for extracting target functions and context functions and constructing contextual code
- llm.py: code for constructing prompts and invoking LLMs
- parser.py: code for parsing python code files in the target repositories
- dokcer_helper.py: helper functions for operating the docker instances, where the test execution takes place
- instance_logger.py: implementation of a concurrent logger
- utils.py: extra helper functions
- swebench.py: helper functions from the SWE-bench validation framework
- swebench_constants.py: constants definition from the SWE-bench validation framework
- swebench_log_parsers.py: log parsers from the SWE-bench validation framework, used to parse the testing outputs

You can find our results here: https://doi.org/10.5281/zenodo.17074796

Evaluation

Configuration

Edit config.py and fill in the following fields:

LOGGING: absolute path to a logging directory. Logs and results will be outputted in this directory.
TEMP_PATH: absolute path to a directory to hold temporary files. Please ensure that your disk have at least 200GB available space
RUNALL_CACHE_PATH: absolute patch to a directory to hold cache of executing all developer tests.
SWEBENCH_REPO: absolute path to the swebench repository. If you don't have this repository, please first clone it to your environment.
OPENAI_API_KEY: an API key for OpenAI LLMs

RQ1

Run the following command to run all available tests:

$ python -m src.framework -o -t {tool_name}

tool_name: the tool to be evaluated. Should be one of OpenHands, LearnByInteract, and CodeStory.

Extra options:

-m: number of concurrent tasks, default to 1. e.g., -m 48
-i: the instance to run. Without this option, PatchDiff will run all developer tests under all plausible patches. e.g., -i django__django-16527
--remove_image: whether to remove docker images after running PatchDiff, default to False. Holding all docker images for SWE-bench tasks could take a lot of disk space (~3GB per task). If you want to safe space, please append this option to your command.
--no_cache_runall: whether to mute caching of runall results after running all developer tests, default to False. Running all developer tests could take a long time, caching the results frees you from this problem the next time you need them (e.g., in RQ2). WARNING: If caching is enabled, this could take ~420GB spaces for the three evaluated tools all together. If you have limited disk space, please consider muting it by setting this option.
--timeout: timeout (in seconds) for running a test, default to 3600 (an hour).

Run python -m src.framework --help for more information.

RQ2

Run the following command to generate differentiating tests:

$ python -m src.framework -t {tool_name}

Apart from the extra options described before, there are two more options:

-n: number of choices for LLMs to generate for each request, default to 10. e.g., -n 5
-r: number of attempts to repair a generated test, default to 2. e.g., -r 1

After running the above command, run the following command to filter out flaky tests (This step is isolated because it could take a long time, and you will not obtain any results in the middle):

$ python -m src.framework --filter_flaky -t {tool_name}

Option -m, --timeout, and --remove_image are allowed.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
environment.yml		environment.yml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PatchDiff

Environments

File Organization

Evaluation

Configuration

RQ1

RQ2

About

Uh oh!

Releases

Packages

Languages

ZJU-CTAG/PatchDiff

Folders and files

Latest commit

History

Repository files navigation

PatchDiff

Environments

File Organization

Evaluation

Configuration

RQ1

RQ2

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages