- We discover that although current methods that practice static safety shaping deliver notable safety gains, their atomic view of each training example creates exploitable blind spots.
- We propose Safety Trajectory Assessment of Response (STAR), a safety signal that enables fine-grained assessment of each training sample, addressing key limitations of static safety shaping.
- We introduce STAR-DSS (⭐DSS), a new training loss that mitigates diverse LLM finetuning risks and achieves significant safety improvements backed by theoretical analysis.
The ⭐ score enables fine-grained safety assessment within training samples, addressing key limitations of static safety shaping. We plot average ⭐ scores as a function of response progression and show that it reliably captures evolving safety risks across different datasets.
Qualitative comparisons of model responses to broader threats in finetuning-as-a-service. We present how different finetuned LLMs behave under (a) response adaptation, (b) prompt poisoning, and (c) harmful prefilling attacks, demonstrating that ⭐DSS consistently produces safer generations across all cases.
Shape it Up! Restoring LLM Safety during Finetuning
ShengYun Peng1,
Pin-Yu Chen2,
Jianfeng Chi3,
Seongmin Lee1,
Duen Horng Chau1
1Georgia Tech,
2IBM Research,
3Meta Superintelligence Labs
In NeurIPS 2025.
@article{peng2025shape,
title={Shape it Up! Restoring LLM Safety during Finetuning},
author={Peng, ShengYun and Chen, Pin-Yu and Chi, Jianfeng and Lee, Seongmin and Chau, Duen Horng},
journal={arXiv preprint arXiv:2505.17196},
year={2025}
}If you have any questions, feel free to open an issue or contact Anthony Peng.


