本章深入探讨基于人类反馈的强化学习(RLHF)及其变体在大语言模型后训练中的应用。我们将从奖励模型的构建开始,详细分析PPO、DPO等主流算法的实现细节,探讨Constitutional AI等自我改进方法,并讨论在线与离线强化学习的权衡。通过本章学习,您将掌握设计和实施RLHF系统的完整方法论,理解不同算法的适用场景,以及避免常见的实验陷阱。
监督微调(SFT)虽然能让模型学会遵循指令的基本格式,但存在几个根本性限制:
行为模仿的局限性:SFT本质上是让模型模仿训练数据中的行为模式。即使有高质量的示范数据,模型也只能学到”如何说”,而非真正理解”为什么这样说更好”。
偏好的隐式性:人类偏好往往是隐式的、多维的,很难通过示例完全表达。比如”有帮助”这个概念,包含准确性、完整性、清晰度等多个维度,且在不同上下文中权重不同。
分布偏移问题:SFT模型在生成时会累积误差,逐渐偏离训练分布。而RLHF通过在模型自己的生成分布上训练,能更好地处理这种偏移。
Human Preferences
↓
┌──────────────┐
│ Reward Model │ ← 偏好数据训练
└──────────────┘
↓
奖励信号
↓
┌──────────────┐
│ RL Training │ ← PPO/DPO等算法
└──────────────┘
↓
Aligned Model
RLHF系统包含三个核心组件:
挑战1:奖励过拟合(Reward Hacking)
模型可能找到获得高奖励但实际质量差的捷径。例如:
挑战2:训练不稳定性
RLHF训练过程容易出现:
挑战3:评估困难
奖励模型的理论基础是Bradley-Terry模型,它假设人类选择回复A优于回复B的概率为:
\[P(A \succ B) = \frac{\exp(r(A))}{\exp(r(A)) + \exp(r(B))} = \sigma(r(A) - r(B))\]其中$r(\cdot)$是奖励函数,$\sigma$是sigmoid函数。
训练目标是最大化对数似然:
\[\mathcal{L}_{RM} = -\mathbb{E}_{(x,y_w,y_l)\sim D}\left[\log\sigma(r_\theta(x,y_w) - r_\theta(x,y_l))\right]\]其中$y_w$是被偏好的回复,$y_l$是较差的回复。
典型的奖励模型架构:
输入: [prompt] + [response]
↓
Transformer Encoder (预训练LM)
↓
最后一个token的隐状态
↓
Linear Head → 标量奖励值
关键设计选择:
# 每批次标准化
rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
# 或使用运行均值/方差
self.running_mean = 0.99 * self.running_mean + 0.01 * batch_mean
技巧1:数据增强
def augment_preference_data(prompt, chosen, rejected):
# 1. 顺序随机化
if random.random() < 0.5:
return prompt, rejected, chosen, -1 # 标签翻转
# 2. 边际案例生成
if similarity(chosen, rejected) > 0.9:
# 为高度相似的对添加噪声
rejected = add_noise(rejected)
return prompt, chosen, rejected, 1
技巧2:集成与不确定性估计
训练多个奖励模型并使用集成:
class EnsembleRewardModel:
def __init__(self, models):
self.models = models
def predict(self, prompt, response):
rewards = [m(prompt, response) for m in self.models]
mean_reward = np.mean(rewards)
uncertainty = np.std(rewards)
# 高不确定性时降低奖励置信度
if uncertainty > threshold:
mean_reward *= 0.8
return mean_reward, uncertainty
技巧3:对抗验证
定期用对抗样本测试奖励模型:
def generate_adversarial_samples(reward_model, base_model):
# 生成高奖励但质量差的样本
prompt = "解释量子力学"
# 策略1:重复关键词
bad_response_1 = "量子力学量子力学..." * 100
# 策略2:空洞的长回复
bad_response_2 = generate_verbose_but_empty(prompt)
# 检查奖励模型是否被欺骗
if reward_model(prompt, bad_response_1) > threshold:
log.warning("奖励模型对重复内容给出高分")
温度缩放(Temperature Scaling)
class CalibratedRewardModel(nn.Module):
def __init__(self, base_model):
super().__init__()
self.base_model = base_model
self.temperature = nn.Parameter(torch.ones(1))
def forward(self, prompt, response):
logits = self.base_model(prompt, response)
return logits / self.temperature
def calibrate(self, val_data):
# 在验证集上优化温度参数
optimizer = torch.optim.LBFGS([self.temperature])
def closure():
loss = 0
for prompt, chosen, rejected in val_data:
prob = torch.sigmoid(
(self(prompt, chosen) - self(prompt, rejected))
)
loss -= torch.log(prob)
return loss
optimizer.step(closure)
期望校准误差(ECE)监控
def compute_ece(predictions, labels, n_bins=10):
bin_boundaries = np.linspace(0, 1, n_bins + 1)
ece = 0
for i in range(n_bins):
mask = (predictions >= bin_boundaries[i]) & \
(predictions < bin_boundaries[i+1])
if mask.sum() > 0:
bin_acc = labels[mask].mean()
bin_conf = predictions[mask].mean()
bin_weight = mask.sum() / len(predictions)
ece += bin_weight * abs(bin_acc - bin_conf)
return ece
PPO(Proximal Policy Optimization)通过限制每次更新的幅度来保证训练稳定性:
\[\mathcal{L}_{PPO} = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]\]其中:
| $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$ 是重要性采样比率 |
挑战1:序列生成的信用分配
在LLM中,一个”动作”是生成一个token,”轨迹”是完整的回复。奖励通常只在序列末尾给出,需要合理的信用分配:
def compute_advantages(rewards, values, gamma=1.0, lam=0.95):
"""
计算广义优势估计(GAE)
rewards: [batch_size, seq_len] 通常只有最后一个非零
values: [batch_size, seq_len] 价值函数预测
"""
advantages = torch.zeros_like(rewards)
lastgaelam = 0
for t in reversed(range(len(rewards[0]))):
if t == len(rewards[0]) - 1:
next_values = 0 # 终止状态
else:
next_values = values[:, t + 1]
delta = rewards[:, t] + gamma * next_values - values[:, t]
advantages[:, t] = lastgaelam = delta + gamma * lam * lastgaelam
return advantages
挑战2:KL散度约束
防止策略偏离太远:
def compute_kl_penalty(logprobs_new, logprobs_ref, kl_coef=0.1):
"""
计算KL散度惩罚
logprobs_new: 当前策略的对数概率
logprobs_ref: 参考策略(通常是SFT模型)的对数概率
"""
kl = (logprobs_ref - logprobs_new).sum(dim=-1)
# 自适应KL系数
if kl.mean() > target_kl * 1.5:
kl_coef *= 1.5 # 增加惩罚
elif kl.mean() < target_kl * 0.5:
kl_coef *= 0.5 # 减少惩罚
return kl * kl_coef
class PPOTrainer:
def __init__(self, policy_model, ref_model, reward_model,
lr=1e-6, eps=0.2, kl_coef=0.1):
self.policy = policy_model
self.ref = ref_model
self.reward = reward_model
self.optimizer = AdamW(policy_model.parameters(), lr=lr)
self.eps = eps
self.kl_coef = kl_coef
def train_step(self, prompts, max_length=512):
# 1. 生成回复
with torch.no_grad():
responses, old_logprobs = self.generate_responses(
prompts, max_length
)
# 2. 计算奖励
rewards = self.reward(prompts, responses)
# 3. 计算参考模型的对数概率
ref_logprobs = self.ref.compute_logprobs(prompts, responses)
# 4. 多轮PPO更新
for _ in range(4): # PPO epochs
# 计算当前策略的对数概率
new_logprobs, values = self.policy.forward_with_value(
prompts, responses
)
# 计算优势
advantages = compute_advantages(rewards, values)
# PPO损失
ratio = torch.exp(new_logprobs - old_logprobs)
clipped_ratio = torch.clamp(ratio, 1 - self.eps, 1 + self.eps)
policy_loss = -torch.min(
ratio * advantages,
clipped_ratio * advantages
).mean()
# KL惩罚
kl_loss = compute_kl_penalty(
new_logprobs, ref_logprobs, self.kl_coef
)
# 价值函数损失
value_loss = F.mse_loss(values, rewards + values.detach())
# 总损失
loss = policy_loss + kl_loss.mean() + 0.5 * value_loss
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
self.optimizer.step()
技巧1:监控关键指标
def log_ppo_metrics(info):
# 必须监控的指标
metrics = {
'kl_divergence': info['kl'].mean(),
'clip_fraction': (info['ratio'] > 1.2).float().mean(),
'approx_kl': (info['ratio'] - 1).pow(2).mean() / 2,
'reward_mean': info['rewards'].mean(),
'reward_std': info['rewards'].std(),
'value_loss': info['value_loss'],
'policy_loss': info['policy_loss'],
'entropy': info['entropy'], # 监控探索程度
}
# 异常检测
if metrics['kl_divergence'] > 0.1:
logger.warning("KL散度过大,可能导致训练不稳定")
if metrics['clip_fraction'] > 0.3:
logger.warning("裁剪比例过高,考虑减小学习率")
return metrics
技巧2:渐进式训练
def progressive_ppo_training(trainer, stages):
"""
分阶段逐步增加训练难度
"""
for stage in stages:
# 阶段1:简单任务,大KL容忍度
if stage == 1:
trainer.kl_coef = 0.05
prompts = get_simple_prompts()
# 阶段2:中等难度,标准KL
elif stage == 2:
trainer.kl_coef = 0.1
prompts = get_medium_prompts()
# 阶段3:困难任务,严格KL
else:
trainer.kl_coef = 0.2
prompts = get_hard_prompts()
for step in range(stage_steps):
trainer.train_step(prompts)
DPO(Direct Preference Optimization)通过重新参数化,将RLHF问题转换为监督学习问题,避免了显式训练奖励模型:
关键洞察:最优策略可以用封闭形式表达:
\[\pi^*(y|x) = \frac{1}{Z(x)}\pi_{ref}(y|x)\exp\left(\frac{r(x,y)}{\beta}\right)\]反推奖励函数:
\[r(x,y) = \beta\log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta\log Z(x)\]代入Bradley-Terry模型,得到DPO损失:
\[\mathcal{L}_{DPO} = -\mathbb{E}\left[\log\sigma\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]\]class DPOTrainer:
def __init__(self, model, ref_model, beta=0.1):
self.model = model
self.ref_model = ref_model
self.beta = beta
self.optimizer = AdamW(model.parameters(), lr=5e-7)
def compute_loss(self, prompts, chosen, rejected):
# 计算策略模型的对数概率
chosen_logps = self.model.compute_logprobs(prompts, chosen)
rejected_logps = self.model.compute_logprobs(prompts, rejected)
# 计算参考模型的对数概率
with torch.no_grad():
ref_chosen_logps = self.ref_model.compute_logprobs(
prompts, chosen
)
ref_rejected_logps = self.ref_model.compute_logprobs(
prompts, rejected
)
# 计算对数概率比
chosen_rewards = self.beta * (chosen_logps - ref_chosen_logps)
rejected_rewards = self.beta * (rejected_logps - ref_rejected_logps)
# DPO损失
loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
# 添加隐式奖励的监控
with torch.no_grad():
reward_accuracy = (chosen_rewards > rejected_rewards).float().mean()
reward_margin = (chosen_rewards - rejected_rewards).mean()
return loss, {
'reward_accuracy': reward_accuracy,
'reward_margin': reward_margin,
'chosen_rewards': chosen_rewards.mean(),
'rejected_rewards': rejected_rewards.mean()
}
IPO(Identity Preference Optimization)解决了DPO的一些问题:
IPO的损失函数:
\[\mathcal{L}_{IPO} = \mathbb{E}\left[\left(\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} - \frac{1}{2\beta}\right)^2\right]\]class IPOTrainer(DPOTrainer):
def compute_loss(self, prompts, chosen, rejected):
# 与DPO相同的对数概率计算
chosen_logps = self.model.compute_logprobs(prompts, chosen)
rejected_logps = self.model.compute_logprobs(prompts, rejected)
with torch.no_grad():
ref_chosen_logps = self.ref_model.compute_logprobs(
prompts, chosen
)
ref_rejected_logps = self.ref_model.compute_logprobs(
prompts, rejected
)
# IPO使用平方损失而非logistic损失
log_ratio_chosen = chosen_logps - ref_chosen_logps
log_ratio_rejected = rejected_logps - ref_rejected_logps
# IPO损失:鼓励差值为1/(2β)
losses = (log_ratio_chosen - log_ratio_rejected - 1/(2*self.beta))**2
return losses.mean(), {
'log_ratio_diff': (log_ratio_chosen - log_ratio_rejected).mean()
}
实验设置对比表:
| 维度 | DPO | IPO |
|---|---|---|
| 损失函数 | Logistic | MSE |
| β参数敏感度 | 高 | 中 |
| 训练稳定性 | 中 | 高 |
| 收敛速度 | 快 | 慢 |
| 过拟合风险 | 高 | 低 |
| 噪声鲁棒性 | 低 | 高 |
选择指南:
def choose_optimization_method(dataset_properties):
"""
根据数据集特性选择DPO或IPO
"""
if dataset_properties['annotation_agreement'] < 0.7:
# 标注一致性低,使用IPO
return 'IPO', '标注噪声大,IPO更鲁棒'
elif dataset_properties['size'] < 10000:
# 数据量小,使用IPO避免过拟合
return 'IPO', '数据量小,IPO泛化更好'
elif dataset_properties['preference_strength'] > 0.9:
# 偏好非常明确,使用DPO
return 'DPO', '偏好明确,DPO收敛快'
else:
# 默认使用DPO
return 'DPO', '标准场景,DPO效率高'
class HybridDPO_IPO:
"""
结合DPO和IPO优点的混合方法
"""
def __init__(self, model, ref_model, beta=0.1, alpha=0.5):
self.model = model
self.ref_model = ref_model
self.beta = beta
self.alpha = alpha # DPO和IPO的混合权重
def compute_loss(self, prompts, chosen, rejected):
# 计算对数概率
chosen_logps = self.model.compute_logprobs(prompts, chosen)
rejected_logps = self.model.compute_logprobs(prompts, rejected)
with torch.no_grad():
ref_chosen_logps = self.ref_model.compute_logprobs(prompts, chosen)
ref_rejected_logps = self.ref_model.compute_logprobs(prompts, rejected)
# 对数比
log_ratio_diff = (chosen_logps - ref_chosen_logps) - \
(rejected_logps - ref_rejected_logps)
# DPO损失
dpo_loss = -F.logsigmoid(self.beta * log_ratio_diff).mean()
# IPO损失
ipo_loss = (log_ratio_diff - 1/(2*self.beta))**2.mean()
# 混合损失
loss = self.alpha * dpo_loss + (1 - self.alpha) * ipo_loss
return loss
Constitutional AI(CAI)使用一组原则来指导模型的自我改进,减少对人类标注的依赖:
原始回复 → AI自我批判 → 修订回复 → AI偏好判断 → 训练
核心组件:
class ConstitutionalAI:
def __init__(self, model, principles):
self.model = model
self.principles = principles
def critique_response(self, prompt, response):
"""
使用宪法原则批判回复
"""
critiques = []
for principle in self.principles:
critique_prompt = f"""
原则:{principle}
用户问题:{prompt}
助手回复:{response}
这个回复是否违反了上述原则?如果是,请说明如何改进。
"""
critique = self.model.generate(critique_prompt)
critiques.append(critique)
return critiques
def revise_response(self, prompt, response, critiques):
"""
基于批判修订回复
"""
revision_prompt = f"""
原始问题:{prompt}
原始回复:{response}
批判意见:
{' '.join(critiques)}
请根据批判意见修订回复,使其更好地遵循原则。
"""
revised = self.model.generate(revision_prompt)
return revised
def generate_preference_data(self, prompts):
"""
生成自我标注的偏好数据
"""
preference_data = []
for prompt in prompts:
# 生成初始回复
response = self.model.generate(prompt)
# 自我批判
critiques = self.critique_response(prompt, response)
# 如果有批判,生成修订版本
if any(critiques):
revised = self.revise_response(prompt, response, critiques)
# 创建偏好对(修订版本 > 原始版本)
preference_data.append({
'prompt': prompt,
'chosen': revised,
'rejected': response
})
return preference_data
RLAIF(RL from AI Feedback)完整流程:
class RLAIFTrainer:
def __init__(self, model, critic_model, principles):
self.model = model
self.critic = critic_model # 可以是同一个模型
self.principles = principles
self.dpo_trainer = DPOTrainer(model, model.copy())
def train_iteration(self, prompts, n_iterations=5):
for iteration in range(n_iterations):
print(f"RLAIF迭代 {iteration + 1}")
# 1. 生成回复
responses = []
for prompt in prompts:
response = self.model.generate(prompt)
responses.append(response)
# 2. AI评分和排序
scored_responses = self.score_responses(prompts, responses)
# 3. 构建偏好数据
preference_data = self.create_preferences(scored_responses)
# 4. DPO训练
for batch in preference_data:
loss = self.dpo_trainer.train_step(batch)
# 5. 评估改进
improvement = self.evaluate_improvement(prompts)
print(f"改进幅度: {improvement:.2%}")
if improvement < 0.01: # 收敛
break
def score_responses(self, prompts, responses):
"""
使用AI评分器给回复打分
"""
scores = []
for prompt, response in zip(prompts, responses):
score_prompt = f"""
根据以下原则评分(1-10分):
{self.principles}
问题:{prompt}
回复:{response}
评分(只返回数字):
"""
score = float(self.critic.generate(score_prompt))
scores.append(score)
return list(zip(prompts, responses, scores))
原则层次结构:
CONSTITUTIONAL_PRINCIPLES = {
# 第一层:安全性原则(最高优先级)
'safety': [
"不提供可能造成伤害的信息",
"避免生成歧视性内容",
"保护用户隐私"
],
# 第二层:有用性原则
'helpfulness': [
"提供准确的信息",
"回答要切中要点",
"承认不确定性"
],
# 第三层:风格原则
'style': [
"保持专业语气",
"避免过度自信",
"适当使用例子"
]
}
def apply_principles_hierarchically(response, principles):
"""
分层应用原则,高优先级原则可覆盖低优先级
"""
for level in ['safety', 'helpfulness', 'style']:
for principle in principles[level]:
if violates_principle(response, principle):
if level == 'safety':
# 安全问题必须修正
return revise_for_safety(response, principle)
else:
# 其他问题尝试修正但不强制
response = soft_revise(response, principle)
return response
在线RL:模型在训练过程中不断与环境交互,生成新数据并立即从中学习。
离线RL:仅使用预先收集的固定数据集进行训练,不与环境实时交互。
在线RL流程:
策略 → 生成 → 奖励 → 更新 → 策略(循环)
离线RL流程:
固定数据集 → 训练 → 策略(一次性)
优势:
挑战:
在线PPO实现:
class OnlinePPO:
def __init__(self, policy, reward_model, buffer_size=1000):
self.policy = policy
self.reward_model = reward_model
self.buffer = []
self.buffer_size = buffer_size
def collect_trajectories(self, prompts, n_samples=4):
"""
实时收集轨迹数据
"""
trajectories = []
for prompt in prompts:
for _ in range(n_samples):
# 在线生成
response = self.policy.generate(prompt)
# 实时计算奖励
reward = self.reward_model(prompt, response)
# 计算优势(需要价值函数)
value = self.policy.value_head(prompt, response)
trajectories.append({
'prompt': prompt,
'response': response,
'reward': reward,
'value': value,
'logprobs': self.policy.get_logprobs(prompt, response)
})
return trajectories
def train_step(self, prompts):
# 收集新数据
new_data = self.collect_trajectories(prompts)
# 更新缓冲区(FIFO)
self.buffer.extend(new_data)
if len(self.buffer) > self.buffer_size:
self.buffer = self.buffer[-self.buffer_size:]
# 在缓冲区数据上训练
for epoch in range(4):
for batch in self.get_batches(self.buffer):
loss = self.ppo_loss(batch)
self.optimize(loss)
优势:
挑战:
离线DPO实现:
class OfflineDPO:
def __init__(self, model, ref_model, dataset):
self.model = model
self.ref_model = ref_model
self.dataset = dataset # 预收集的偏好数据
def train(self, n_epochs=3):
"""
纯离线训练,不生成新数据
"""
for epoch in range(n_epochs):
for batch in self.dataset:
# 使用固定数据集
loss = self.compute_dpo_loss(
batch['prompts'],
batch['chosen'],
batch['rejected']
)
self.optimize(loss)
# 离线评估
val_loss = self.evaluate_offline()
print(f"Epoch {epoch}: Val Loss = {val_loss:.4f}")
def compute_importance_weights(self, batch):
"""
计算重要性权重以缓解分布偏移
"""
with torch.no_grad():
# 当前策略的概率
current_probs = self.model.get_probs(batch['prompts'], batch['responses'])
# 数据收集时的概率(如果有)
old_probs = batch.get('old_probs', torch.ones_like(current_probs))
# 重要性权重
weights = current_probs / (old_probs + 1e-8)
# 裁剪防止权重爆炸
weights = torch.clamp(weights, 0.1, 10.0)
return weights
结合两者优势的实用方案:
class SemiOnlineRL:
def __init__(self, model, offline_data, online_ratio=0.2):
self.model = model
self.offline_data = offline_data
self.online_ratio = online_ratio
self.online_buffer = []
def train_step(self, prompts):
batch_size = len(prompts)
# 1. 离线数据采样
offline_size = int(batch_size * (1 - self.online_ratio))
offline_batch = self.sample_offline(offline_size)
# 2. 在线数据生成(少量)
online_size = batch_size - offline_size
online_batch = self.generate_online(
prompts[:online_size]
)
# 3. 混合训练
combined_batch = self.combine_batches(
offline_batch,
online_batch
)
# 4. 加权更新
loss = self.weighted_loss(combined_batch)
self.optimize(loss)
def weighted_loss(self, batch):
"""
对在线和离线数据使用不同权重
"""
losses = []
for item in batch:
if item['source'] == 'online':
# 在线数据权重更高(更可信)
weight = 1.5
else:
# 离线数据权重较低
weight = 1.0
loss = self.compute_loss(item) * weight
losses.append(loss)
return torch.stack(losses).mean()
def choose_rl_strategy(constraints):
"""
根据实际约束选择RL策略
"""
if constraints['safety_critical']:
# 安全要求高,使用纯离线
return 'offline', "安全第一,避免未知风险"
elif constraints['compute_budget'] < 100: # GPU小时
# 计算预算有限,使用离线
return 'offline', "计算资源受限"
elif constraints['data_quality'] < 0.7:
# 数据质量差,需要在线探索
return 'online', "数据质量不足,需要主动改进"
elif constraints['deployment_type'] == 'production':
# 生产环境,使用半在线
return 'semi_online', "平衡安全性和性能"
else:
# 研究环境,使用在线
return 'online', "追求最佳性能"
技术1:保守正则化
def conservative_regularization(policy, ref_policy, responses, alpha=0.1):
"""
CQL-style保守正则化,防止离线RL过度乐观
"""
# 计算OOD(out-of-distribution)动作的Q值
ood_responses = policy.sample(temperature=1.5) # 高温采样OOD
ood_q_values = policy.q_function(ood_responses)
# 惩罚OOD动作的高Q值
conservative_loss = alpha * ood_q_values.mean()
return conservative_loss
技术2:分布感知采样
class DistributionAwareSampler:
def __init__(self, offline_data, model):
self.offline_data = offline_data
self.model = model
# 预计算数据分布特征
self.data_embeddings = self.compute_embeddings(offline_data)
self.distribution_stats = self.compute_stats(self.data_embeddings)
def sample_in_distribution(self, n_samples):
"""
优先采样分布内的数据
"""
candidates = []
for _ in range(n_samples * 5): # 过采样
response = self.model.generate()
embedding = self.get_embedding(response)
# 计算与训练分布的距离
distance = self.compute_distance(
embedding,
self.distribution_stats
)
candidates.append((response, distance))
# 选择最接近训练分布的样本
candidates.sort(key=lambda x: x[1])
return [c[0] for c in candidates[:n_samples]]
本章深入探讨了基于人类反馈的强化学习(RLHF)及其变体在大语言模型后训练中的应用。我们学习了:
Bradley-Terry偏好模型: \(P(A \succ B) = \sigma(r(A) - r(B))\)
PPO目标函数: \(\mathcal{L}_{PPO} = \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\)
DPO损失函数: \(\mathcal{L}_{DPO} = -\log\sigma\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\)
练习6.1:奖励模型设计 设计一个奖励模型架构,用于评估代码生成任务的质量。考虑如何处理语法正确性、功能完整性和代码风格等多个维度。
练习6.2:KL散度计算 给定参考分布 $p_{ref} = [0.1, 0.2, 0.3, 0.4]$ 和当前分布 $p_{\theta} = [0.15, 0.25, 0.35, 0.25]$,计算KL散度 $D_{KL}(p_{\theta} || p_{ref})$。
练习6.3:DPO vs PPO场景选择 列举三个适合使用DPO而非PPO的具体场景,并说明原因。
练习6.4:奖励过拟合检测 设计一个方法来自动检测RLHF训练过程中的奖励过拟合(reward hacking)现象。
练习6.5:Constitutional AI原则设计 为一个医疗咨询AI助手设计一套宪法原则层次结构,确保安全性、准确性和有用性的平衡。
练习6.6:在线离线RL混合策略 设计一个自适应的在线/离线RL混合训练策略,能够根据训练过程中的表现动态调整在线数据的比例。
练习6.7:多目标RLHF优化 设计一个方法来同时优化多个可能冲突的目标(如有用性、安全性、创造性),并处理它们之间的权衡。
错误表现:
调试方法:
# 检测奖励过拟合
def detect_reward_overfitting(model, reward_model, test_prompts):
responses = model.generate(test_prompts)
rewards = reward_model(responses)
# 检查1:奖励分布是否异常集中
if rewards.std() < 0.1:
print("警告:奖励分布过于集中")
# 检查2:响应长度是否异常
lengths = [len(r) for r in responses]
if np.mean(lengths) > 2 * expected_length:
print("警告:响应长度异常")
# 检查3:词汇多样性
vocab_diversity = compute_vocab_diversity(responses)
if vocab_diversity < 0.3:
print("警告:词汇多样性过低")
错误表现:
预防措施:
错误表现:
解决方案:
错误表现:
处理方法:
错误表现:
优化技巧:
💡 最佳实践建议: