Shinnosuke Ono | CS Master's Student, University of Tokyo
Shinnosuke Ono | CS Master's Student, University of Tokyo
Home
Posts
Projects
Publications
CV
Reward Hacking
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
We address reward hacking in RLHF by identifying flipped advantage signs as a key cause and proposing Sign-Certified Policy Optimization (SignCert-PO), a lightweight method that down-weights non-robust completions during policy optimization.
Shinnosuke Ono
,
Johannes Ackermann
,
Soichiro Nishimori
,
Takashi Ishida
,
Masashi Sugiyama
PDF
DOI
Cite
×