RLHF经典论文

很久之后才会发现,原来大浪淘沙过后的经典的价值那么高。

SAC

model-free deep RL methods are notoriously expensive in terms of their sample complexity.

Off-policy algorithms aim to reuse past experience. This is not directly feasible with conventional policy gradient formulations, but is relatively straightforward for Q-learning based methods (Mnih et al., 2015).

On-policy training tends to improve stability but results in poor sample complexity.

the interplay between the deterministic actor network and the Q-function typically makes DDPG extremely difficult to stabilize and brittle to hyperparameter settings (Duan et al., 2016; Henderson et al., 2017).As a consequence, it is difficult to extend DDPG to complex, high-dimensional tasks, and on-policy policy gradient methods still tend to produce the best results in such settings (Gu et al., 2016).

HH-RLHF(Anthropic)

context-distill

larger PMs are more robust than smaller PMs. overfitting increases during RLHF training

large models have alignment bonuses while small models may have taxes.

$\sqrt{D_{KL}(\pi||\pi_0)}$ and reward are approximately linearly related. [Section 4.3]

一种解释是 $D_{KL}(\pi+\delta \pi || \pi)$ 根据 $\delta \pi$ 展开,在很小的region内,和 $\delta \pi$ 是线性的。【所以想把RLHF限制在很小的邻域内】,在Stiennon et al., 2020中有对拒绝采样的一些解释【但感觉还是不能解释】

Basic Scaling Results for preference model : we observe log-linear trends in both dataset and model size.

PM和Human的Align程度(PM作为reward的合理性):We observe that PMs trained only on helpfulness data are very well calibrated, but PMs trained on a mixture of helpful and harmless data are slightly under-confident. we can trust that they faithfully encode the probabilities that humans will prefer specific model samples。

越往后越难学习。学习HH的model会比学习Helpful-only的model更难和人类偏好一致。觉得因为在

好厉害的一句话:we expect that with more experience and study we could do better

image-20240825165720046

所以reward和KL之间的关系是什么?

Scaling Laws for Reward Model Overoptimization

通过一些数据生成了一个 gold RM,利用 gold RM 生成了一个 proxy RM。下文中 $d=\sqrt {D\text{KL}(\pi || \pi{\text{ref}})}$.

1、Best of N的方法:We use the unbiased estimator from Nakano et al. [2021, Appendix I] . $KL_{bon}\approx \log n-\frac{n-1}{n}$

在同样的数据量和策略网络的情况下,考虑reward model的参数大小:其中 $\alpha{\text{bo}n}$ 随着reward model参数增大线性增长, $\beta{\text{bo}n},\beta{\text{RL}}$ 随着reward model参数增大线性增长,而 $\alpha{\text{RL}}$ 基本不变。

数据量增大也明显能提升效果,但没有具体的公式(也类似于KL先增加后减少),

更大的策略网络参数,使用RL的方法并不会使得最优点的KL发生较大变化。

Effect of KL penalty

并不影响真实golden reward,效果相当于early stop(即使proxy reward已经很大了)。

iterative RLHF 迭代 $k$ 次,不过没仔细看,好像也没有更多的文章了。

image-20240827151626124

scaling law for DPO

感觉没什么用。或者说看不出有什么真的东西,但确实很容易过拟合就是了。

scaling law