
Rethinking the Function of PPO in RLHF – The Berkeley Synthetic Intelligence Analysis Weblog
Rethinking the Function of PPO in RLHF TL;DR: In RLHF, there’s rigidity between the reward studying part, which makes use of human choice within the type of comparisons, and the RL fine-tuning part, which optimizes a single, non-comparative reward. What if we carried out RL in a comparative means? Determine 1: This diagram illustrates the…