RLHF经典论文

2024-08-22 笔记 0 Comments

很久之后才会发现，原来大浪淘沙过后的经典的价值那么高。

SAC

model-free deep RL methods are notoriously expensive in terms of their sample complexity.

Off-policy algorithms aim to reuse past experience. This is not directly feasible with conventional policy gradient formulations, but is relatively straightforward for Q-learning based methods (Mnih et al., 2015).

On-policy training tends to improve stability but results in poor sample complexity.

the interplay between the deterministic actor network and the Q-function typically makes DDPG extremely difficult to stabilize and brittle to hyperparameter settings (Duan et al., 2016; Henderson et al., 2017).As a consequence, it is difficult to extend DDPG to complex, high-dimensional tasks, and on-policy policy gradient methods still tend to produce the best results in such settings (Gu et al., 2016).

HH-RLHF(Anthropic)

context-distill：

larger PMs are more robust than smaller PMs. overfitting increases during RLHF training

large models have alignment bonuses while small models may have taxes.

$\sqrt{D_{KL}(\pi||\pi_0)}$ and reward are approximately linearly related. [Section 4.3]

一种解释是 $D_{KL}(\pi+\delta \pi || \pi)$ 根据 $\delta \pi$ 展开，在很小的region内，和 $\delta \pi$ 是线性的。【所以想把RLHF限制在很小的邻域内】，在Stiennon et al., 2020中有对拒绝采样的一些解释【但感觉还是不能解释】

Basic Scaling Results for preference model : we observe log-linear trends in both dataset and model size.

PM和Human的Align程度（PM作为reward的合理性）：We observe that PMs trained only on helpfulness data are very well calibrated, but PMs trained on a mixture of helpful and harmless data are slightly under-confident. we can trust that they faithfully encode the probabilities that humans will prefer specific model samples。

越往后越难学习。学习HH的model会比学习Helpful-only的model更难和人类偏好一致。觉得因为在

好厉害的一句话：we expect that with more experience and study we could do better

所以reward和KL之间的关系是什么？

Scaling Laws for Reward Model Overoptimization

通过一些数据生成了一个 gold RM，利用 gold RM 生成了一个 proxy RM。下文中 $d=\sqrt {D\text{KL}(\pi || \pi{\text{ref}})}$.

1、Best of N的方法：We use the unbiased estimator from Nakano et al. [2021, Appendix I] . $KL_{bon}\approx \log n-\frac{n-1}{n}$

$R_{\text{bo}n}(d)=d(\alpha_{\text{bo}n} -\beta_{\text{bo}n} d)\\ R_{\text{RL}}=d(\alpha_{\text{RL}}-\beta_{\text{RL}} \log d)$

在同样的数据量和策略网络的情况下，考虑reward model的参数大小：其中 $\alpha{\text{bo}n}$ 随着reward model参数增大线性增长， $\beta{\text{bo}n},\beta{\text{RL}}$ 随着reward model参数增大线性增长，而 $\alpha{\text{RL}}$ 基本不变。

数据量增大也明显能提升效果，但没有具体的公式（也类似于KL先增加后减少），

更大的策略网络参数，使用RL的方法并不会使得最优点的KL发生较大变化。

Effect of KL penalty

并不影响真实golden reward，效果相当于early stop（即使proxy reward已经很大了）。

iterative RLHF 迭代 $k$ 次，不过没仔细看，好像也没有更多的文章了。

$R_{\text{RL}}=d(\alpha_{\text{RL}}-\beta_{\text{RL}} \log d+\beta_{\text{RL}}\log k)$

scaling law for DPO

感觉没什么用。或者说看不出有什么真的东西，但确实很容易过拟合就是了。

scaling law

本文链接： http://emoairx.github.io/blog/2024/08/22/RL_reading1/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

emoairxPKU,EECS

春天来了，冬天还会远吗~