Skip to content

poloclub/star-dss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shape it Up! Restoring LLM Safety during Finetuning [NeurIPS'25]

arxiv badge

  • We discover that although current methods that practice static safety shaping deliver notable safety gains, their atomic view of each training example creates exploitable blind spots.
  • We propose Safety Trajectory Assessment of Response (STAR), a safety signal that enables fine-grained assessment of each training sample, addressing key limitations of static safety shaping.
  • We introduce STAR-DSS (⭐DSS), a new training loss that mitigates diverse LLM finetuning risks and achieves significant safety improvements backed by theoretical analysis.

Demo

The ⭐ score enables fine-grained safety assessment within training samples, addressing key limitations of static safety shaping. We plot average ⭐ scores as a function of response progression and show that it reliably captures evolving safety risks across different datasets.

Demo

Qualitative comparisons of model responses to broader threats in finetuning-as-a-service. We present how different finetuned LLMs behave under (a) response adaptation, (b) prompt poisoning, and (c) harmful prefilling attacks, demonstrating that ⭐DSS consistently produces safer generations across all cases.

Demo

Research Paper

Shape it Up! Restoring LLM Safety during Finetuning

ShengYun Peng1, Pin-Yu Chen2, Jianfeng Chi3, Seongmin Lee1, Duen Horng Chau1
1Georgia Tech, 2IBM Research, 3Meta Superintelligence Labs

In NeurIPS 2025.

Citation

@article{peng2025shape,
  title={Shape it Up! Restoring LLM Safety during Finetuning},
  author={Peng, ShengYun and Chen, Pin-Yu and Chi, Jianfeng and Lee, Seongmin and Chau, Duen Horng},
  journal={arXiv preprint arXiv:2505.17196},
  year={2025}
}

Contact

If you have any questions, feel free to open an issue or contact Anthony Peng.

About

NeurIPS'25 - Dynamic Safety Shaping

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published