Pull Request (PR) Title & Description Generation using LLM

Environment Setup

$ conda create -n pr-llm python=3.10
$ conda activate pr-llm
$ pip install -r requirements.txt

Pull Request Crawling

To collect up to 1000 most recent merged PRs from the tp 100 starred projects:

$ cd crawling
$ python fetch_github_pr_data.py --repos 100 --prs-per-repo 1000 --output-dir github_pr_dataset_v3

You can use your GitHub Personal Access Token by passing it to the --token argument to get a higher rate limit.

PR Data Preprocessing

The following command performs preprocessing using the techniques implemented in preprocess_pr_data.py, excluding entries with non-ASCII characters and entries with empty PR titles or descriptions:

$ cd pr_summary
$ python preprocess_pr_data.py --data_file <jsonl-dataset-filepath> --output_dir <output-folder-path> --exclude_non_ascii --exclude_missing_critical

Our curated dataset: https://drive.google.com/file/d/1JPYccvLV3C_Jl5OUNawZRcZI6maYS0S1/view?usp=sharing

PR Summary Generation via LLM:

Example command:

$ cd pr_summary
$ python gen_summary.py --model_name unsloth/codegemma-7b-it-bnb-4bit --data_file <jsonl-dataset-filepath> --output_dir <output-filepath> --max_seq_length 65536

Fine Tuning

Example command:

$ cd pr_summary
python fine_tune.py --model_name unsloth/Qwen2.5-Coder-0.5B-Instruct-bnb-4bit --data_file <jsonl-dataset-filepath> --output_dir <finetuned-model-output-folder> --max_seq_length 65536 --num_train_epochs 5

After fine-tuning, you can use the lora_model folder path inside the model output folder as the model_name.

Metric Evaluation

After generating the PR summary, to evaluate the response

$ cd pr_summary
$ python -m metrics.bleu --output-dir <pr-summary-folder>
$ python -m metrics.rouge --output-dir <pr-summary-folder>

A Json file containing scores will be saved in the corresponding folder.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
crawling		crawling
pr_summary		pr_summary
.gitignore		.gitignore
ReadMe.md		ReadMe.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pull Request (PR) Title & Description Generation using LLM

Environment Setup

Pull Request Crawling

PR Data Preprocessing

PR Summary Generation via LLM:

Fine Tuning

Metric Evaluation

About

Uh oh!

Releases

Packages

Languages

risal-shefin/pr-llm-summary

Folders and files

Latest commit

History

Repository files navigation

Pull Request (PR) Title & Description Generation using LLM

Environment Setup

Pull Request Crawling

PR Data Preprocessing

PR Summary Generation via LLM:

Fine Tuning

Metric Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages