Shinnosuke Ono | CS Master's Student, University of Tokyo
Shinnosuke Ono | CS Master's Student, University of Tokyo
Home
Posts
Projects
Publications
CV
RLHF
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
We address reward hacking in RLHF by identifying flipped advantage signs as a key cause and proposing Sign-Certified Policy Optimization (SignCert-PO), a lightweight method that down-weights non-robust completions during policy optimization.
Shinnosuke Ono
,
Johannes Ackermann
,
Soichiro Nishimori
,
Takashi Ishida
,
Masashi Sugiyama
PDF
DOI
Cite
×