很久之后才会发现,原来大浪淘沙过后的经典的价值那么高。
SAC
model-free deep RL methods are notoriously expensive in terms of their sample complexity.
Off-policy algorithms aim to reuse past experience. This is not directly feasible with conventional policy gradient formulations, but is relatively straightforward for Q-learning based methods (Mnih et al., 2015).
On-policy training tends to improve stability but results in poor sample complexity.
the interplay between the deterministic actor network and the Q-function typically makes DDPG extremely difficult to stabilize and brittle to hyperparameter settings (Duan et al., 2016; Henderson et al., 2017).As a consequence, it is difficult to extend DDPG to complex, high-dimensional tasks, and on-policy policy gradient methods still tend to produce the best results in such settings (Gu et al., 2016).
HH-RLHF(Anthropic)
context-distill:
larger PMs are more robust than smaller PMs. overfitting increases during RLHF training
large models have alignment bonuses while small models may have taxes.
$\sqrt{D_{KL}(\pi||\pi_0)}$ and reward are approximately linearly related. [Section 4.3]
一种解释是 $D_{KL}(\pi+\delta \pi || \pi)$ 根据 $\delta \pi$ 展开,在很小的region内,和 $\delta \pi$ 是线性的。【所以想把RLHF限制在很小的邻域内】,在Stiennon et al., 2020中有对拒绝采样的一些解释【但感觉还是不能解释】
Basic Scaling Results for preference model : we observe log-linear trends in both dataset and model size.
PM和Human的Align程度(PM作为reward的合理性):We observe that PMs trained only on helpfulness data are very well calibrated, but PMs trained on a mixture of helpful and harmless data are slightly under-confident. we can trust that they faithfully encode the probabilities that humans will prefer specific model samples。
越往后越难学习。学习HH的model会比学习Helpful-only的model更难和人类偏好一致。觉得因为在
好厉害的一句话:we expect that with more experience and study we could do better
所以reward和KL之间的关系是什么?
Scaling Laws for Reward Model Overoptimization
通过一些数据生成了一个 gold RM
,利用 gold RM
生成了一个 proxy RM
。下文中 $d=\sqrt {D\text{KL}(\pi || \pi{\text{ref}})}$.
1、Best of N的方法:We use the unbiased estimator from Nakano et al. [2021, Appendix I] . $KL_{bon}\approx \log n-\frac{n-1}{n}$
在同样的数据量和策略网络的情况下,考虑reward model的参数大小:其中 $\alpha{\text{bo}n}$ 随着reward model参数增大线性增长, $\beta{\text{bo}n},\beta{\text{RL}}$ 随着reward model参数增大线性增长,而 $\alpha{\text{RL}}$ 基本不变。
数据量增大也明显能提升效果,但没有具体的公式(也类似于KL先增加后减少),
更大的策略网络参数,使用RL的方法并不会使得最优点的KL发生较大变化。
Effect of KL penalty
并不影响真实golden reward,效果相当于early stop(即使proxy reward已经很大了)。
iterative RLHF 迭代 $k$ 次,不过没仔细看,好像也没有更多的文章了。
scaling law for DPO
感觉没什么用。或者说看不出有什么真的东西,但确实很容易过拟合就是了。