Skip to content

ZJU-CTAG/PatchDiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

PatchDiff

This is the replication package for our paper Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study, containing the code for PatchDiff, the dataset we used for our study, and the results.

PatchDiff is an LLM-based repository-level differential testing tool, designed for differentiating two patches applied in the same repository.

Environments

  1. Language: Python (Python 3.10)
  2. Key Python Packages:

An environment.yml is also provided for easy replication.

File Organization

  • data/: dataset and results

  • src/: code for PatchDiff

    • benchmark.py: code for loading dataset
    • config.py: configuration file
    • framework.py: the framework for PatchDiff and the main entrance for executing
    • extractor.py: code for extracting target functions and context functions and constructing contextual code
    • llm.py: code for constructing prompts and invoking LLMs
    • parser.py: code for parsing python code files in the target repositories
    • dokcer_helper.py: helper functions for operating the docker instances, where the test execution takes place
    • instance_logger.py: implementation of a concurrent logger
    • utils.py: extra helper functions
    • swebench.py: helper functions from the SWE-bench validation framework
    • swebench_constants.py: constants definition from the SWE-bench validation framework
    • swebench_log_parsers.py: log parsers from the SWE-bench validation framework, used to parse the testing outputs

You can find our results here: https://doi.org/10.5281/zenodo.17074796

Evaluation

Configuration

Edit config.py and fill in the following fields:

  • LOGGING: absolute path to a logging directory. Logs and results will be outputted in this directory.
  • TEMP_PATH: absolute path to a directory to hold temporary files. Please ensure that your disk have at least 200GB available space
  • RUNALL_CACHE_PATH: absolute patch to a directory to hold cache of executing all developer tests.
  • SWEBENCH_REPO: absolute path to the swebench repository. If you don't have this repository, please first clone it to your environment.
  • OPENAI_API_KEY: an API key for OpenAI LLMs

RQ1

Run the following command to run all available tests:

$ python -m src.framework -o -t {tool_name} 
  • tool_name: the tool to be evaluated. Should be one of OpenHands, LearnByInteract, and CodeStory.

Extra options:

  • -m: number of concurrent tasks, default to 1. e.g., -m 48
  • -i: the instance to run. Without this option, PatchDiff will run all developer tests under all plausible patches. e.g., -i django__django-16527
  • --remove_image: whether to remove docker images after running PatchDiff, default to False. Holding all docker images for SWE-bench tasks could take a lot of disk space (~3GB per task). If you want to safe space, please append this option to your command.
  • --no_cache_runall: whether to mute caching of runall results after running all developer tests, default to False. Running all developer tests could take a long time, caching the results frees you from this problem the next time you need them (e.g., in RQ2). WARNING: If caching is enabled, this could take ~420GB spaces for the three evaluated tools all together. If you have limited disk space, please consider muting it by setting this option.
  • --timeout: timeout (in seconds) for running a test, default to 3600 (an hour).

Run python -m src.framework --help for more information.

RQ2

Run the following command to generate differentiating tests:

$ python -m src.framework -t {tool_name} 

Apart from the extra options described before, there are two more options:

  • -n: number of choices for LLMs to generate for each request, default to 10. e.g., -n 5
  • -r: number of attempts to repair a generated test, default to 2. e.g., -r 1

After running the above command, run the following command to filter out flaky tests (This step is isolated because it could take a long time, and you will not obtain any results in the middle):

$ python -m src.framework --filter_flaky -t {tool_name} 

Option -m, --timeout, and --remove_image are allowed.

About

A technique for differential patch testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages