探索用于强化微调的模型评分器
本指南面向已熟悉 OpenAI API、对强化微调 (RFT) 有基本了解,并希望将其微调模型用于研究或其他用途的开发人员和机器学习从业者。OpenAI 的服务并非用于任何医疗状况的个性化治疗或诊断,并受我们适用条款的约束。
强化微调 (RFT) 旨在通过探索解决方案空间并强化能带来更高奖励的策略来提高推理模型的推理性能。RFT 有助于模型做出更敏锐的决策并更有效地解释上下文。
在本指南中,我们将介绍如何将 RFT 应用于 OpenAI o4-mini
推理模型,使用生命科学研究领域的一个任务:根据医患对话和描述预测结果,这是许多健康研究研究中的必要评估。我们将使用医学-o1-可验证问题 数据集 的一个子集。您将学习成功运行 RFT 作业以满足您用例的关键步骤。
我们将涵盖以下内容:
1. 设置
即使是强大的推理模型,在专家级行为方面也可能达不到目标——尤其是在医学等细微差别和精确性至关重要的领域。想象一个模型试图从对话中提取 ICD-10 代码:即使它理解了要点,它也可能不会使用医学专业人员期望的确切术语。
RFT 的其他绝佳应用场景包括分类账标准化或分级欺诈风险等主题——在这些场景中,您需要精确、可靠且可重复的推理。请查看我们的 RFT 用例指南 以获取精彩示例。
在我们的案例中,我们将专注于教导 o4-mini
更好地预测临床对话和描述的结果。具体来说,我们想看看 RFT 是否能提高预测的准确性。
在此过程中,我们将讨论如何编写有效的评分器、它们如何指导模型的学习以及如何警惕经典的奖励黑客行为陷阱。
2. 收集数据集
让我们从 Hugging Face 加载数据集开始。我们对那些被构成为患者病例描述及其相关问题,然后是正确答案的样本感兴趣。这些样本代表了医生总结病例并分配结果的真实对话记录。对于任何用例,验证黄金答案的准确性至关重要,需要仔细考虑。在这里,我们将信任数据集的质量。
import re
from datasets import load_dataset
ds = load_dataset("FreedomIntelligence/medical-o1-verifiable-problem")
def is_age_question(sample):
question = sample.get('Open-ended Verifiable Question', '')
# Match "A 88-year-old", "An 8-year-old", "A 23-year-old", etc. at the start
return re.match(r"^(A|An) \d{1,2}-year-old", question) is not None
filtered_samples = [s for s in ds["train"] if is_age_question(s)]
print(f"Filtered samples: {len(filtered_samples)}")
/Users/theophile/Documents/repos/jupyter-env/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Filtered samples: 9169
RFT 的优势之一是它不需要数千个样本就能开始产生差异。得益于轨迹采样和训练期间的反馈循环,模型不仅能学习正确的行为,还能避免不良模式。这意味着即使使用小型数据集,我们也能看到稳健的收益。
对于本次运行,我们将随机抽取 100 个训练样本和 100 个测试样本,并对其进行轻微标准化。
import random
# Set a random seed for reproducibility
random.seed(42)
# Randomly select 100 training samples from filtered_samples
train_samples = random.sample(filtered_samples, min(100, len(filtered_samples)))
# Remove training samples from filtered_samples to avoid overlap
remaining_samples = [s for s in filtered_samples if s not in train_samples]
# Randomly select 100 test samples from the remaining samples (no overlap)
test_samples = random.sample(remaining_samples, min(100, len(remaining_samples)))
print(f"Number of training samples: {len(train_samples)}")
print(f"Number of test samples: {len(test_samples)}")
Number of training samples: 100
Number of test samples: 100
# Standardize the 'Ground-True Answer' fields to all lowercase in train and test samples
for sample in train_samples:
if 'Ground-True Answer' in sample and isinstance(sample['Ground-True Answer'], str):
sample['Ground-True Answer'] = sample['Ground-True Answer'].lower()
for sample in test_samples:
if 'Ground-True Answer' in sample and isinstance(sample['Ground-True Answer'], str):
sample['Ground-True Answer'] = sample['Ground-True Answer'].lower()
我们将这些样本转换为 jsonl
格式,这是 强化微调 API 所期望的。
import json
def convert_to_jsonl_format(samples, filename):
with open(filename, "w") as f:
for sample in samples:
user_content = sample.get("Open-ended Verifiable Question", "")
reference_answer = sample.get("Ground-True Answer", "")
json_obj = {
"messages": [
{"role": "user", "content": user_content}
],
"reference_answer": reference_answer
}
f.write(json.dumps(json_obj) + "\n")
def load_jsonl(filename):
samples = []
with open(filename, "r") as f:
for line in f:
samples.append(json.loads(line))
return samples
# Save the datasets to jsonl files
convert_to_jsonl_format(train_samples, "data/medical_01_verifiable_problem_train.jsonl")
convert_to_jsonl_format(test_samples, "data/medical_01_verifiable_problem_val.jsonl")
# Load the datasets back from jsonl files
train_samples_loaded = load_jsonl("data/medical_01_verifiable_problem_train.jsonl")
test_samples_loaded = load_jsonl("data/medical_01_verifiable_problem_val.jsonl")
接下来,我们将了解基础模型开箱即用的表现——以及它还有多大的成长空间。
3. 对基础模型进行基准测试
在微调任何内容之前,我们需要了解我们的起点。基准测试让我们清楚地了解模型最初的优势和劣势——以便我们以后衡量它的进步程度。
我们将首先依赖两个简单而强大的评估器:
clinical_phrase_binary_grader
- 一个精确匹配检查器。clinical_phrase_grader
- 一个更宽松的、基于令牌的相似度评分器。
from rapidfuzz import fuzz, utils
def clinical_phrase_grader(sample: dict, item: dict) -> float:
from rapidfuzz import fuzz, utils
score = fuzz.token_set_ratio(sample["output_text"], item["reference_answer"], processor=utils.default_process)
return score / 100.0
def clinical_phrase_binary_grader(sample: dict, item: dict) -> float:
return 1.0 if sample["output_text"] == item["reference_answer"] else 0.0
def combined_grader(sample: dict, item: dict, weights: list[float] = [0.85, 0.15]) -> float:
clinical_phrase_score = clinical_phrase_grader(sample, item)
binary_score = clinical_phrase_binary_grader(sample, item)
return weights[0] * clinical_phrase_score + weights[1] * binary_score
这种组合使我们能够跟踪严格的正确性和部分词汇重叠。二进制评分器给出清晰的 0 或 1:模型是否产生了精确匹配?更宽松的评分器提供了更多细微差别——输出与黄金答案有多接近?我们同时使用两者,因为结果通常以多种有效方式表达。例如,模型可能响应“痛风性关节炎”而不是“痛风”。虽然人工评估者可能认为这部分可接受,但严格的字符串匹配则不行。结合精确和模糊评分可确保对模型输出进行更准确和公平的评估。
我们构建一个辅助函数来在示例前添加系统提示。
def prepend_system_prompt_to_first_user_message(samples, system_prompt, path=None):
new_samples = []
for sample in samples:
# Deep copy to avoid mutating the original
sample_copy = json.loads(json.dumps(sample))
messages = sample_copy.get("messages", [])
if messages and messages[0].get("role") == "user" and isinstance(messages[0].get("content"), str):
if not messages[0]["content"].startswith(system_prompt):
messages[0]["content"] = f"{system_prompt}\n\n{messages[0]['content']}"
new_samples.append(sample_copy)
if path is not None:
with open(path, "w", encoding="utf-8") as f:
for item in new_samples:
f.write(json.dumps(item, ensure_ascii=False) + "\n")
return new_samples
simple_prompt = """You are an expert clinician. For each clinical vignette, respond with exactly one phrase: the single most likely outcome or phenomenon, all in lowercase.
- Do not add punctuation, articles, explanations, or commentary - output only the term itself.
- Sometimes, the expected answer can be a synonym of what you think.
- Use the standard clinical name (e.g. “thought withdrawal”, “Toxoplasma encephalitis”)."""
train_samples_loaded_simple_sys_prompt = prepend_system_prompt_to_first_user_message(
train_samples_loaded, simple_prompt, path="data/medical_01_verifiable_problem_train_simple_prompt.jsonl"
)
test_samples_loaded_simple_sys_prompt = prepend_system_prompt_to_first_user_message(
test_samples_loaded, simple_prompt, path="data/medical_01_verifiable_problem_val_simple_prompt.jsonl"
)
然后构建一个辅助函数来生成和存储模型的预测。
from openai import OpenAI
import concurrent.futures
from tqdm import tqdm
import os
client = OpenAI()
def generate_model_predictions(
subset,
prompt_type,
model_name="o4-mini-2025-04-16",
reasoning_effort="medium",
n_runs=1,
verbose=False,
):
if isinstance(subset, str):
samples_path = f"data/medical_01_verifiable_problem_{subset}_{prompt_type}_prompt.jsonl"
with open(samples_path, "r", encoding="utf-8") as f:
test_samples = [json.loads(line) for line in f if line.strip()]
else:
test_samples = [subset]
def run_inference(item):
resp = client.responses.create(
model=model_name,
input=item["messages"],
reasoning={"effort": reasoning_effort, "summary": "detailed"},
)
model_prediction = {'output_text': resp.output_text}
reasoning_tokens_used = resp.usage.output_tokens_details.reasoning_tokens
summaries = [seg.text for item in resp.output if item.type == "reasoning" for seg in item.summary]
summaries_string = "\n".join(summaries)
if verbose:
print("Prompt: {}".format(item["messages"][0]["content"]))
print(f"Model Sample: {model_prediction}\nSolution: {item['reference_answer']}\n")
return {
"model_prediction": model_prediction["output_text"],
"input": item,
"reasoning_tokens_used": reasoning_tokens_used,
"reference_answer": item["reference_answer"],
"summaries": summaries_string
}
# Ensure the predictions directory exists before any file operations
predictions_dir = os.path.join("data", "rft", "predictions")
os.makedirs(predictions_dir, exist_ok=True)
# Check if results already exist for all runs
results_per_run = []
for run_idx in range(n_runs):
run_save_path = os.path.join(
predictions_dir,
f"{subset}_{prompt_type}_{model_name}_{reasoning_effort}_predictions_run{run_idx+1}.json"
)
if os.path.exists(run_save_path):
print(f"Results for run {run_idx+1} already exist at {run_save_path}. Loading results.")
with open(run_save_path, "r", encoding="utf-8") as f:
run_results = json.load(f)
results_per_run.append(run_results)
else:
if len(test_samples) == 1:
run_results = [run_inference(test_samples[0])]
else:
run_results = []
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(run_inference, item) for item in test_samples]
for future in tqdm(futures, total=len(futures), desc=f"Generating predictions (run {run_idx+1})"):
result = future.result()
run_results.append(result)
with open(run_save_path, "w", encoding="utf-8") as f:
json.dump(run_results, f, ensure_ascii=False, indent=2)
results_per_run.append(run_results)
# Return a flat list for backward compatibility
if n_runs == 1:
return results_per_run[0]
else:
return results_per_run
要生成预测,请先确保您的 API 密钥已设置:
export OPENAI_API_KEY=...
# OpenAI o4-mini model
results_simple_o4mini = generate_model_predictions(
subset="train",
prompt_type="simple",
model_name="o4-mini",
reasoning_effort="medium",
n_runs=3
)
# OpenAI o3 model
results_simple_o3 = generate_model_predictions(
subset="train",
prompt_type="simple",
model_name="o3",
reasoning_effort="medium",
n_runs=3
)
现在我们有了可以评估的预测。
我们将构建一个辅助函数,使我们能够轻松地切换不同的评分方法,
import functools
def evaluate_predictions_with_grader(
predictions,
grader_func=combined_grader,
):
results = []
if isinstance(predictions, dict):
predictions = [predictions]
def run_grading(pred):
model_prediction = {"output_text": pred["model_prediction"]}
item = pred["input"]
score = grader_func(model_prediction, item)
result = pred.copy()
result["score"] = score
return result
if len(predictions) == 1:
result = run_grading(predictions[0])
results.append(result)
else:
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(run_grading, pred) for pred in predictions]
for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Grading predictions"):
results.append(future.result())
total = len(results)
correct = sum(r["score"] for r in results)
accuracy = correct / total if total else 0.0
metrics = {
"total_samples": total,
"accuracy": accuracy,
}
print(metrics)
return metrics, results
def run_prediction_evaluation(
model_name="o4-mini",
reasoning_effort="medium",
prompt_type="simple",
subset="train",
grader_func=combined_grader,
num_runs=3,
):
if isinstance(grader_func, functools.partial):
name = grader_func.func.__name__
mg = grader_func.keywords["model_grader"]
mg_name = mg["name"]
name = f"{name}_{mg_name}"
else:
name = getattr(grader_func, "__name__", getattr(grader_func, "__class__", type(grader_func)).__name__)
grader_func_name = name.replace(" ", "_").replace(":", "_").replace("/", "_").replace(",", "_")
for i in range(num_runs):
preds_path = f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_predictions_run{i+1}.json"
with open(preds_path, "r") as f:
preds = json.load(f)
metrics, results_with_scores = evaluate_predictions_with_grader(preds, grader_func=grader_func)
# Save the scored results
with open(f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{i+1}_scored.json", "w") as f:
json.dump(results_with_scores, f, indent=2)
# Save the metrics
with open(f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{i+1}_metrics.json", "w") as f:
json.dump(metrics, f, indent=2)
# Save the scores (if present in results_with_scores)
scores = [item.get("score") for item in results_with_scores if "score" in item]
with open(f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{i+1}_scores.json", "w") as f:
json.dump(scores, f, indent=2)
return metrics, results_with_scores # Return last metrics and results for convenience
def load_predictions(
model_name="o4-mini",
reasoning_effort="medium",
prompt_type="simple",
subset="train",
grader_func_name="clinical_phrase_grader",
num_runs=3
):
all_predictions = []
all_metrics = []
for run in range(1, num_runs + 1):
pred_path = f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{run}_scored.json"
metrics_path = f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{run}_metrics.json"
try:
with open(pred_path, "r") as f:
predictions = json.load(f)
except FileNotFoundError:
predictions = None
try:
with open(metrics_path, "r") as f:
metrics = json.load(f)
except FileNotFoundError:
metrics = None
all_predictions.append(predictions)
all_metrics.append(metrics)
return all_predictions, all_metrics
然后运行评估。
model_name = "o4-mini"
reasoning_effort = "medium"
prompt_type = "simple"
subset = "train"
grader_func = combined_grader
grader_func_name = "combined_grader"
num_runs = 3
run_prediction_evaluation(
model_name=model_name,
reasoning_effort=reasoning_effort,
prompt_type=prompt_type,
subset=subset,
grader_func=grader_func,
num_runs=num_runs
)
predictions_o4mini_medium_simple_prompt, metrics_o4mini_medium_simple_prompt = load_predictions(model_name=model_name, reasoning_effort=reasoning_effort, prompt_type=prompt_type, subset=subset, grader_func_name=grader_func_name, num_runs=num_runs)
Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 610524.60it/s]
{'total_samples': 100, 'accuracy': 0.590985993228499}
Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 311612.48it/s]
{'total_samples': 100, 'accuracy': 0.5750433490539723}
Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 769597.06it/s]
{'total_samples': 100, 'accuracy': 0.5943742483874717}
可视化结果使我们能够发现趋势和失败模式。
# Print mistakes where the model did not get the correct answer (score < 1.0)
mistakes = [
{"index": i, **res}
for i, res in enumerate(predictions_o4mini_medium_simple_prompt[0])
if res["score"] < 1.0
]
print(f"\nTotal mistakes: {len(mistakes)}")
for m in mistakes[15:20]:
print(f"\n[Sample {m['index']}]")
print(f" Model prediction: {m['model_prediction']}")
print(f" Reference answer: {m['reference_answer']}")
print(f" Score: {m['score']}")
Total mistakes: 86
[Sample 18]
Model prediction: acute anterior uveitis
Reference answer: recurring eye redness and pain
Score: 0.3596153846153846
[Sample 19]
Model prediction: 390 meq
Reference answer: 150 meq
Score: 0.6071428571428571
[Sample 20]
Model prediction: adamts13 deficiency
Reference answer: decreased adamts13 activity in serum
Score: 0.5037037037037037
[Sample 22]
Model prediction: todd paralysis
Reference answer: seizure
Score: 0.16190476190476194
[Sample 23]
Model prediction: hypokalemia
Reference answer: hypomagnesemia
Score: 0.612
如上所示,典型的失败模式分为三类:
- 微小差异和格式问题,得分 >=0.8。
- 部分词汇匹配,0.3 < 得分 < 0.8。
- 词汇上不相关,得分 < 0.3。
我们可以可视化训练集上的完整得分分布。
注意:在实践中,大规模分析模型错误通常涉及手动审查和自动化方法的结合——例如标记失败类型或按得分和内容对预测进行聚类。该工作流程超出了本指南的范围,但一旦您确定了广泛的模式,这是一个有价值的后续步骤。
import matplotlib.pyplot as plt
scores_distribution = [m['score'] for m in predictions_o4mini_medium_simple_prompt[0]]
plt.hist(scores_distribution, alpha=0.6, label='o4-mini medium simple prompt')
plt.legend()
<matplotlib.legend.Legend at 0x125f6b7a0>
让我们与其他模型和提示进行比较,并可视化得分。
# OpenAI o3 model
model_name = "o3"
reasoning_effort = "medium"
prompt_type = "simple"
subset = "train"
grader_func = combined_grader
grader_func_name = "combined_grader"
num_runs = 3
run_prediction_evaluation(model_name=model_name, reasoning_effort=reasoning_effort, prompt_type=prompt_type, subset=subset, grader_func=grader_func, num_runs=num_runs)
predictions_o3_medium_simple_prompt, metrics_o3_medium_simple_prompt = load_predictions(model_name=model_name, reasoning_effort=reasoning_effort, prompt_type=prompt_type, subset=subset, grader_func_name=grader_func_name, num_runs=num_runs)
Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 820803.13it/s]
{'total_samples': 100, 'accuracy': 0.6186850707880021}
Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 523633.46it/s]
{'total_samples': 100, 'accuracy': 0.6149897683385446}
Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 515270.76it/s]
{'total_samples': 100, 'accuracy': 0.6254662232084496}
import numpy as np
import pandas as pd
import seaborn as sns
def average_and_std_metrics(metrics_list):
"""Returns dicts of mean and std for a list of metrics dicts."""
if not metrics_list: return {}, {}
keys = metrics_list[0].keys()
arr = {k: np.array([m[k] for m in metrics_list]) for k in keys}
mean = {k: float(np.mean(arr[k])) for k in keys}
std = {k: float(np.std(arr[k])) for k in keys}
return mean, std
def plot_model_accuracies(model_metrics_avg, model_metrics_std, grader_title="Combined Grader Accuracy", sharey: bool = True) -> None:
"""Plots model accuracies with standard deviation error bars."""
# Convert the nested dicts into tidy DataFrames
df_avg = pd.DataFrame(model_metrics_avg).T.reset_index().rename(columns={"index": "Model"})
df_std = pd.DataFrame(model_metrics_std).T.reset_index().rename(columns={"index": "Model"})
# Long-form for Seaborn
long_df_avg = df_avg.melt(id_vars="Model", value_vars=["accuracy"], var_name="Metric", value_name="Accuracy")
long_df_std = df_std.melt(id_vars="Model", value_vars=["accuracy"], var_name="Metric", value_name="Std")
# Merge avg and std for error bars
long_df = pd.merge(long_df_avg, long_df_std, on=["Model", "Metric"])
pretty_names = {"accuracy": grader_title}
# Create a separate figure for each metric
for metric_key in ["accuracy"]:
metric_df = long_df[long_df["Metric"] == metric_key].copy()
plt.figure(figsize=(8, 5))
# Plot bars with error bars
ax = sns.barplot(data=metric_df, x="Model", y="Accuracy", hue="Model", palette="tab10", legend=False, errorbar=None)
bars = ax.patches
# Add error bars manually
for i, row in enumerate(metric_df.itertuples()):
bar = bars[i]
x = bar.get_x() + bar.get_width() / 2
y = row.Accuracy
yerr = row.Std
ax.errorbar(x=x, y=y, yerr=yerr, fmt='none', ecolor='black', capsize=5, elinewidth=2, capthick=2, zorder=10)
plt.title(pretty_names[metric_key])
plt.ylabel("Accuracy")
plt.xlabel("")
if sharey: plt.ylim(0, 1)
# Annotate bars with exact values
for bar in bars:
height = bar.get_height()
ax.annotate(f"{height:.2f}", xy=(bar.get_x() + bar.get_width() / 2, height), xytext=(0, 6), textcoords="offset points", ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.xticks(rotation=15, ha="right")
plt.tight_layout()
plt.show()
avg_metrics_o4mini_medium_simple_prompt, std_metrics_o4mini_medium_simple_prompt = average_and_std_metrics(metrics_o4mini_medium_simple_prompt)
avg_metrics_o3_medium_simple_prompt, std_metrics_o3_medium_simple_prompt = average_and_std_metrics(metrics_o3_medium_simple_prompt)
model_metrics_avg = {
"o4-mini-medium-simple-prompt": avg_metrics_o4mini_medium_simple_prompt,
"o3-medium-simple-prompt": avg_metrics_o3_medium_simple_prompt,
}
model_metrics_std = {
"o4-mini-medium-simple-prompt": std_metrics_o4mini_medium_simple_prompt,
"o3-medium-simple-prompt": std_metrics_o3_medium_simple_prompt,
}
plot_model_accuracies(model_metrics_avg, model_metrics_std, grader_title="Combined Grader Accuracy")
我们可以看到模型的性能有明显的局限性。在实践中,迭代提示通常有助于提高基线结果并从基础模型中获得更多收益。然而,在这种情况下,我们的提示工程并未带来有意义的改进——因此我们排除了这些运行的分析。
RFT 起作用的一个关键要求是基础模型能够成功完成任务的至少一些示例。初始准确率约为 0.6 是一个强烈信号,表明 RFT 可以提高性能。如果模型从未在您的任务上成功过,那么就没有可以攀登的学习信号。
此评估过程为我们准备了下一步:使用结构化、高质量的评分器反馈来指导模型。
4. 定义评分器
评分器定义了塑造 RFT 期间模型行为的奖励函数。它提供所需输出的示例——并惩罚不良输出。设计有效的评分器需要原则性的结构和深思熟虑的领域见解,这也许是成功 RFT 最重要的任务。
在本节中,我们将介绍 3 个评分器,展示如何设置它们以适应 API,并讨论它们产生的结果。然后,我们将展示如何实际启动 RFT 任务。
基于字符串的评分器
我们从一个使用我们早期评估函数的双评分器开始,因为它提供了一个得分分布,该分布将与预测与参考答案的词汇接近度保持一致。它提供了一个起点,但信号不够丰富,无法让 o4-mini
真正学习和改进,第一次实验表明在 RFT 运行期间奖励停滞不前。对于 API 调用,您应该构建如下所示的 Python 评分函数。
import inspect
# --- Utility functions ---
def build_python_grader_payload(grader_fn) :
"""Build a payload for a python grader."""
grader_source = inspect.getsource(grader_fn)
# Enforce function name to be `grade`
grader_source = grader_source.replace(grader_fn.__name__, "grade", 1)
return {
"type": "python",
"source": grader_source,
}
multi_python_grader_tool_call = {
"type": "multi",
"graders": {
"clinical_phrase": {
"name": "clinical_phrase_grader",
"image_tag": "2025-05-08",
**build_python_grader_payload(clinical_phrase_grader),
},
"clinical_phrase_binary": {
"name": "clinical_phrase_binary_grader",
"image_tag": "2025-05-08",
**build_python_grader_payload(clinical_phrase_binary_grader),
},
},
"calculate_output": "0.85 * clinical_phrase + 0.15 * clinical_phrase_binary",
}
这是其训练曲线的快照,其中绿色曲线是训练集奖励,蓝色曲线是测试集奖励:
模型评分器 1
为了解决这个限制,我们引入了一种更高级的方法:模型评分器。基于模型的评分器使我们能够将语义理解和细微差别嵌入到反馈中。当涉及特定领域的同义词或模糊推理时,这尤其强大。
我们使用 gpt-4.1 作为评分模型,并以强调语义保真度的规则为指导:临床同义词、正确的疾病分类和概念对齐。评分器旨在回答“这是否反映了正确的结果或现象?”,而不是关注肤浅的措辞——例如,“这是否是相同的字符串?”
为确保评分器与专家期望保持一致,我们在基础模型预测的子集上对其进行了评估。对于任何生产用例,领域专家评审员应确认模型分配的分数是否反映了首选答案顺序并与领域判断一致。这通常涉及确认模型评分器是否正确地对预测进行排名。在此食谱的范围内,我们通过使用 OpenAI o3
来检查较高质量的预测是否相对于其替代方案得到了一致的奖励来近似此评估。
通过对 o3
的这些讨论,我们迭代更新模型评分器,直到结果一致。
GRADER_PROMPT_1 = """
System:
You are an expert medical grader. Compare the **Reference Answer** to the **Model's Answer** and produce **only** a JSON object with:
• **result**: a float between 0.0 and 1.0
• **steps**: a list of reasoning steps (each with a `"description"` and a `"conclusion"`)
Scoring rubric (start at 0.0, then add or subtract):
1. Exact lexical match: **+0.15**
2. Clinical synonym (e.g. “withdrawal of thought” ↔ “thought withdrawal”): **+0.35**
3. Same disease family (e.g. two viral encephalitides): **+0.35**
4. Partial term overlap (e.g. “ulcer” in both phrases): **+0.15**
5. Completely unrelated: **-0.10**
• If multiple criteria apply, sum their weights (max 1.0).
• Cap the final score to the [0.0, 1.0] range.
• In your **steps**, show which rule you applied and the running subtotal.
"""
为了通过 API 提交,字典的构建方式如下。
model_grader_1 = {
"type": "score_model",
"name": "gpt41_score_model_1",
"input": [
{
"role": "system",
"content": GRADER_PROMPT_1
},
{
"role": "user",
"content": "Reference Answer: {{item.reference_answer}}. Model's Answer: {{sample.output_text}}"
}
],
"pass_threshold": 0.75,
"model": "gpt-4.1-2025-04-14",
"range": [0, 1],
"sampling_params": {
"seed": 42,
"temperature": 0,
},
}
因此,我们本地设置了模型评分器来检查我们接下来要微调的模型的结果。
response_format = {
"name": "float_score_classification",
"strict": True,
"schema": {
"type": "object",
"properties": {
"steps": {
"type": "array",
"description": "A sequence of steps outlining the reasoning process.",
"items": {
"type": "object",
"properties": {
"description": {
"type": "string",
"description": "Detailed description of the reasoning in this step."
},
"conclusion": {
"type": "string",
"description": "The conclusion of the reasoning in this step."
}
},
"required": ["description", "conclusion"],
"additionalProperties": False
}
},
"result": {
"type": "number",
"description": "The float score assigned to the response. This should be in inclusive range RANGE_MIN to RANGE_MAX."
}
},
"required": ["steps", "result"],
"additionalProperties": False
}
}
# for completions
response_format = {
"type": "json_schema",
"json_schema": response_format
}
# Adapted python_model_grader to match the other graders' interface
def python_model_grader(sample, item, model_grader=model_grader_1):
"""
Calls an OpenAI model to grade the model output against the reference answer.
Expects sample to have "output_text", item to have "reference_answer".
Returns a float score (parsed from the model's JSON response).
"""
# Prepare the prompt as the grader expects
system_prompt = model_grader["input"][0]["content"]
user_prompt = model_grader["input"][1]["content"]
user_prompt_filled = user_prompt.replace("{{item.reference_answer}}", item["reference_answer"]).replace("{{sample.output_text}}", sample["output_text"])
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt_filled}
]
# Call the OpenAI API with the grader's model
response = client.chat.completions.create(
model=model_grader["model"],
messages=messages,
seed=model_grader.get("sampling_params", {}).get("seed", None),
temperature=model_grader.get("sampling_params", {}).get("temperature", 0),
response_format=response_format,
)
# Parse the float score from the model's JSON response
parsed = json.loads(response.choices[0].message.content)
return float(parsed["result"])
虽然规则最初提供了合理的反馈,但模型很快发现了一个漏洞并开始奖励黑客行为。分数飙升——有时提高 20-30 个百分点——不是因为临床准确性提高了,而是因为模型用同义词、剂量和完整的管理计划填充了它的“一个短语”答案。您可能会看到 begin warfarin therapy **and** continue unfractionated heparin for ≥5 days, overlapping until the INR is in the therapeutic range (2–3)
或 chewable aspirin 325 mg stat plus nitroglycerin…
而不是要求的 continue unfractionated heparin
或 aspirin
。尽管系统提示很明确——“仅回复一个短语:最可能的结果或现象”——但这些冗长的输出会人为地提高词汇相似度得分,而没有精确地增加预测价值。这种经验凸显了持续检查模型输出的必要性,并对可能悄悄扭曲评估指标的奖励黑客行为保持警惕。
这是其训练曲线的快照(绿色是训练奖励,蓝色是测试奖励):
模型评分器 2
为了缓解这种奖励黑客行为,我们通过澄清期望、强制执行更严格的输出约束并提供正确与不正确行为的对比示例来改进评分器提示。同样,我们使用 o3
进行了迭代,利用了基础 o4-mini
和先前模型黑客行为示例的预测,来设计和验证我们的评分器。此更新评分器的另一个重要点是降低词汇相似度的权重,以确保临床相似度占主导地位。
GRADER_PROMPT_2 = """You are an expert medical grader.
Compare the reference_answer (gold standard) with the model_prediction
and return **exactly** this JSON object:
{
"steps": [ // each: {"description": "...", "conclusion": "..."}
…
],
"result": <float 0-1 rounded to 3 decimals>
}
──────────────── Input placeholders ───────────────
reference_answer:
model_prediction:
──────────── Normalisation steps ────────────
• lowercase, strip punctuation / excess whitespace
• expand common abbreviations (e.g. cll → chronic lymphocytic leukemia)
• map both strings to ICD-10 / SNOMED concepts when possible
──────────── Clinical layer rubric ───────────
L1 exact concept or universally accepted synonym
L2 same concept but benign modifier differs (e.g. “acute”, “left”)
L3 same disease / drug family but wrong subtype or variant
L4 same organ system but entirely different disease / intervention
L5 only partial mechanistic overlap (e.g. both vasodilators)
L6 unrelated or nonsensical
──────────── Scoring parameters ─────────────
clinical_weight = 0.90
lexical_weight = 0.10
clinical_similarity = {1:1.00, 2:0.85, 3:0.45, 4:0.30, 5:0.10, 6:0.00}
lexical_similarity = normalized_levenshtein(reference_answer,
model_prediction)
# Optional penalty if a clinically critical adjective is missing
critical_modifiers = [
"wide", "narrow", "acute", "chronic", "posteromedial",
"oxidized", "oxidised", "left", "right"
]
modifier_pen = -0.05 if any(
w in reference_answer and w not in model_prediction
for w in critical_modifiers
) else 0.0
# Determine layer L (1-6) per rubric above using ontology + judgment.
if L == 6:
score = 0.0
else:
score = (clinical_weight * clinical_similarity[L] +
lexical_weight * lexical_similarity) + modifier_pen
Clamp to [0,1] and round to 3 decimals.
Output **only** the JSON.
──────────────── Worked examples ─────────────
reference_answer: beta-thalassemia major
model_prediction: beta-thalassemia minor
reasoning: Both involve β-globin chain synthesis, but “major” causes
transfusion-dependent anemia while “minor” is largely benign;
same family, wrong subtype → **L3**. Lexical ≈ 0.83.
score = 0.90·0.45 + 0.10·0.83 = 0.488 → **0.488**
reference_answer: ACE inhibitor
model_prediction: angiotensin-receptor blocker
reasoning: Both act on the renin–angiotensin axis yet on different
targets; only partial mechanistic overlap → **L5**.
Lexical ≈ 0.31.
score = 0.90·0.10 + 0.10·0.31 = 0.121 → **0.121**
reference_answer: acute pancreatitis
model_prediction: pancreatitis
reasoning: Same disorder but missing timing adjective “acute”;
benign modifier difference → **L2**. Lexical ≈ 0.78.
score = 0.90·0.85 + 0.10·0.78 = 0.843 → **0.843**
reference_answer: valproate
model_prediction: valproic acid
reasoning: Valproic acid is the active moiety of valproate; mechanisms
and indications are identical → **L1**. Lexical ≈ 0.82.
score = 0.90·1.00 + 0.10·0.82 = 0.982 → **0.982**
reference_answer: riboflavin
model_prediction: riboflavin deficiency
reasoning: Adds “deficiency” but refers to the same vitamin (B₂);
benign modifier difference → **L2**. Lexical ≈ 0.60.
score = 0.90·0.85 + 0.10·0.60 = 0.825 → **0.825**
reference_answer: splenectomy
model_prediction: acetaminophen overdose
reasoning: Surgical removal of the spleen has no mechanistic or anatomic
relationship to toxic drug ingestion → **L6**.
score = **0.000**
reference_answer: ulcerative colitis
model_prediction: Crohn disease
reasoning: Both are inflammatory-bowel diseases but differ in location,
histology and management; same organ system, different disease
→ **L4**. Lexical ≈ 0.38.
score = 0.90·0.30 + 0.10·0.38 = 0.308 → **0.308**"""
model_grader_2 = {
"type": "score_model",
"name": "gpt41_score_model_2",
"input": [
{
"role": "system",
"content": GRADER_PROMPT_2
},
{
"role": "user",
"content": "Reference Answer: {{item.reference_answer}}. Model's Answer: {{sample.output_text}}"
}
],
"pass_threshold": 0.75,
"model": "gpt-4.1-2025-04-14",
"range": [0, 1],
"sampling_params": {
"seed": 42,
"temperature": 0,
},
}
最终结果是一个高信号、领域敏感的评分器,它指导模型进行更合适、更简洁的预测。
关于成本的注意事项:LLM 评分器除了训练计算外,还会产生令牌使用费。为了有效管理成本,我们建议:
- 在本地对基础模型完成项(以及可选的合成完成项)进行测试,以确保评分器符合您的规则或人类偏好。在可用时,使用 flex processing 以获得更有效的评估。
- 从小规模 RFT 运行开始,以验证评分器的一致性并在扩大规模之前检测潜在的奖励黑客行为。
让我们在下一步中看看如何启动训练!
5. 训练
在您的提示和评分器最终确定后,您可以继续进行训练。本节展示了如何使用您最终的评分器启动 RFT——但自然地,您在尝试早期评分器以评估其性能时已经运行了类似的命令。
我们确保评分器通过了 API 测试,
import requests
API_KEY = os.environ["OPENAI_API_KEY"]
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
# Validate a grader configuration for fine-tuning
payload = {"grader": model_grader_2}
try:
response = requests.post(
"https://api.openai.com/v1/fine_tuning/alpha/graders/validate",
json=payload,
headers=HEADERS,
)
response.raise_for_status()
print("Grader validated")
except requests.exceptions.RequestException as e:
print(f"Error validating grader: {e}")
if 'response' in locals():
print(f"Response: {response.text}")
Grader validated
并将训练集和测试集上传到 OpenAI 文件系统。
# Set your training and test file paths
train_file = "data/medical_01_verifiable_problem_train_simple_prompt.jsonl"
test_file = "data/medical_01_verifiable_problem_val_simple_prompt.jsonl"
def upload_file(file_path: str) -> str:
"""Upload a file to the OpenAI platform for fine-tuning."""
print(f"Uploading file: {file_path}")
with open(file_path, 'rb') as f:
response = requests.post(
"https://api.openai.com/v1/files",
headers=HEADERS,
files={"file": f},
data={"purpose": "fine-tune"}
)
response.raise_for_status()
file_id = response.json()["id"]
print(f"File uploaded successfully. File ID: {file_id}")
return file_id
train_file_id = train_file
if train_file.endswith("jsonl"):
print(f"Training file detected: {train_file}")
train_file_id = upload_file(train_file)
test_file_id = test_file
if test_file and test_file.endswith("jsonl"):
print(f"test file detected: {test_file}")
test_file_id = upload_file(test_file)
Training file detected: data/medical_01_verifiable_problem_train_simple_prompt.jsonl
Uploading file: data/medical_01_verifiable_problem_train_simple_prompt.jsonl
File uploaded successfully. File ID: file-19L9jKsJXNJ17DtjvPwN3M
test file detected: data/medical_01_verifiable_problem_val_simple_prompt.jsonl
Uploading file: data/medical_01_verifiable_problem_val_simple_prompt.jsonl
File uploaded successfully. File ID: file-78q2N1QAMKhLiRK3zVB6MC
现在让我们定义运行的超参数。我们将微调 o4-mini
,使用 medium
推理强度。此参数将通过限制模型用于推理的令牌数量来影响持续时间。我们使用中等的计算乘数和合理的 epoch 数进行调整,优先考虑效率和快速迭代。此外,我们将 eval_samples
参数设置为 3,以使验证曲线在 o4-mini
输出的随机性方面更加稳健。对多个样本进行平均可以减少噪声,并有助于揭示一致的学习模式。
您需要根据您的预算、期望的泛化能力和数据集难度来定制这些参数。
# Set the model and other parameters
model = "o4-mini-2025-04-16"
suffix = "medical_01_verifiable_problem_gpt41_grader"
reasoning_effort = "medium"
n_epochs = 5
seed = 42
grader = model_grader_2
response_format_predictions = None
compute_multiplier = 1.0
eval_samples = 3
eval_interval = 5
我们现在准备启动运行!
# Launch the RFT job
payload = dict(
training_file=train_file_id,
validation_file=test_file_id,
model=model,
suffix=suffix,
method=dict(
type="reinforcement",
reinforcement=dict(
grader=grader,
response_format=response_format_predictions,
hyperparameters=dict(
compute_multiplier=compute_multiplier,
eval_samples=eval_samples,
eval_interval=eval_interval,
n_epochs=n_epochs,
reasoning_effort=reasoning_effort,
)
)
),
seed=seed
)
try:
response = requests.post(
"https://api.openai.com/v1/fine_tuning/jobs",
json=payload,
headers=HEADERS,
)
response.raise_for_status()
job_id = response.json().get("id")
if job_id:
print("Training job created with ID:", job_id)
print(
f"View the job details at: https://platform.openai.com/finetune/{job_id}")
else:
print("Failed to retrieve job ID from response.")
except requests.exceptions.RequestException as e:
print(f"An error occurred while creating the training job: {e}")
if 'response' in locals():
print(f"Response: {response.text}")
Training job created with ID: ftjob-tt3B7l45hLUoaXGJRfoL1lLT
View the job details at: https://platform.openai.com/finetune/ftjob-tt3B7l45hLUoaXGJRfoL1lLT
在 仪表板 上,您可以观察奖励图——它们让您能够跟踪整体性能在各个步骤中的改进情况,而每个评分器的图表则分解了多评分器情况下的特定组件。推理令牌使用趋势(通常随着模型变得更加自信而减少)和步骤持续时间指标可以洞察效率。评分器延迟和错误计数图有助于确保您的评分器在运行期间保持高性能且无错误。
这是我们训练曲线的快照,其中绿色和橙色曲线代表训练集,而蓝色和红色曲线代表测试子集:
在训练期间,测试集上的评估运行会直接记录到 评估 API。您可以前往那里跟踪样本的表现,并了解预测如何随时间演变。
6. 使用微调后的模型
训练完成后,您可以通过其 model_id
调用新模型并对其改进进行基准测试。期待更敏锐的预测!
# To retrieve information about a fine-tuning job (including the fine-tuned model id), use the job_id:
response = requests.get(
f"https://api.openai.com/v1/fine_tuning/jobs/{job_id}",
headers=HEADERS,
)
if response.ok:
data = response.json()
if data.get("status") == "succeeded":
fine_tuned_model_id = data.get("fine_tuned_model")
else:
fine_tuned_model_id = None
else:
raise Exception(f"Request failed: {response.status_code} - {response.text}")
print("Fine-tuned model id:", fine_tuned_model_id)
模型的预测得分
让我们计算基础模型和微调模型的得分进行比较。
from functools import partial
model_name = fine_tuned_model_id
reasoning_effort = "medium"
prompt_type = "simple"
subset = "val"
grader_func = partial(python_model_grader, model_grader=model_grader_2)
grader_func_name = "python_model_grader_gpt41_score_model_2"
num_runs = 3
results_ft_model_grader_2 = generate_model_predictions(
subset=subset,
prompt_type=prompt_type,
model_name=model_name,
reasoning_effort=reasoning_effort,
n_runs=num_runs
)
run_prediction_evaluation(
model_name=model_name,
reasoning_effort=reasoning_effort,
prompt_type=prompt_type,
subset=subset,
grader_func=grader_func,
num_runs=num_runs
)
predictions_ftmodel_medium_simple_prompt_model_grader_2, metrics_ftmodel_medium_simple_prompt_model_grader_2 = load_predictions(
model_name=model_name,
reasoning_effort=reasoning_effort,
prompt_type=prompt_type,
subset=subset,
grader_func_name=grader_func_name,
num_runs=num_runs
)
Generating predictions (run 1): 100%|██████████| 100/100 [01:16<00:00, 1.30it/s]
Generating predictions (run 2): 100%|██████████| 100/100 [01:25<00:00, 1.17it/s]
Generating predictions (run 3): 100%|██████████| 100/100 [01:07<00:00, 1.49it/s]
Grading predictions: 100%|██████████| 100/100 [00:22<00:00, 4.51it/s]
{'total_samples': 100, 'accuracy': 0.7730899999999999}
Grading predictions: 100%|██████████| 100/100 [00:17<00:00, 5.57it/s]
{'total_samples': 100, 'accuracy': 0.7697499999999999}
Grading predictions: 100%|██████████| 100/100 [00:19<00:00, 5.01it/s]
{'total_samples': 100, 'accuracy': 0.78996}
model_name = "o4-mini"
reasoning_effort = "medium"
prompt_type = "simple"
subset = "val"
grader_func = partial(python_model_grader, model_grader=model_grader_2)
grader_func_name = "python_model_grader_gpt41_score_model_2"
num_runs = 3
results_o4mini_model_grader_2 = generate_model_predictions(
subset=subset,
prompt_type=prompt_type,
model_name=model_name,
reasoning_effort=reasoning_effort,
n_runs=num_runs
)
run_prediction_evaluation(
model_name=model_name,
reasoning_effort=reasoning_effort,
prompt_type=prompt_type,
subset=subset,
grader_func=grader_func,
num_runs=num_runs
)
predictions_o4mini_medium_simple_prompt_model_grader_2, metrics_o4mini_medium_simple_prompt_model_grader_2 = load_predictions(
model_name=model_name,
reasoning_effort=reasoning_effort,
prompt_type=prompt_type,
subset=subset,
grader_func_name=grader_func_name,
num_runs=num_runs
)
Generating predictions (run 1): 0%| | 0/100 [00:00<?, ?it/s]
Generating predictions (run 1): 100%|██████████| 100/100 [01:11<00:00, 1.39it/s]
Generating predictions (run 2): 100%|██████████| 100/100 [00:42<00:00, 2.34it/s]
Generating predictions (run 3): 100%|██████████| 100/100 [00:41<00:00, 2.40it/s]
Grading predictions: 100%|██████████| 100/100 [00:19<00:00, 5.20it/s]
{'total_samples': 100, 'accuracy': 0.72282}
Grading predictions: 100%|██████████| 100/100 [00:19<00:00, 5.14it/s]
{'total_samples': 100, 'accuracy': 0.72807}
Grading predictions: 100%|██████████| 100/100 [00:17<00:00, 5.65it/s]
{'total_samples': 100, 'accuracy': 0.74812}
model_name = "o3"
reasoning_effort = "medium"
prompt_type = "simple"
subset = "val"
grader_func = partial(python_model_grader, model_grader=model_grader_2)
grader_func_name = "python_model_grader_gpt41_score_model_2"
num_runs = 3
results_o3_model_grader_2 = generate_model_predictions(
subset=subset,
prompt_type=prompt_type,
model_name=model_name,
reasoning_effort=reasoning_effort,
n_runs=num_runs
)
run_prediction_evaluation(
model_name=model_name,
reasoning_effort=reasoning_effort,
prompt_type=prompt_type,
subset=subset,
grader_func=grader_func,
num_runs=num_runs
)
predictions_o3_medium_simple_prompt_model_grader_2, metrics_o3_medium_simple_prompt_model_grader_2 = load_predictions(
model_name=model_name,
reasoning_effort=reasoning_effort,
prompt_type=prompt_type,
subset=subset,
grader_func_name=grader_func_name,
num_runs=num_runs
)
Generating predictions (run 1): 0%| | 0/100 [00:00<?, ?it/s]
Generating predictions (run 1): 100%|██████████| 100/100 [01:01<00:00, 1.62it/s]
Generating predictions (run 2): 100%|██████████| 100/100 [00:52<00:00, 1.90it/s]
Generating predictions (run 3): 100%|██████████| 100/100 [01:13<00:00, 1.37it/s]
Grading predictions: 100%|██████████| 100/100 [00:21<00:00, 4.55it/s]
{'total_samples': 100, 'accuracy': 0.74015}
Grading predictions: 100%|██████████| 100/100 [00:16<00:00, 6.08it/s]
{'total_samples': 100, 'accuracy': 0.7515900000000001}
Grading predictions: 100%|██████████| 100/100 [00:16<00:00, 6.13it/s]
{'total_samples': 100, 'accuracy': 0.74235}
avg_metrics_o4mini_medium_simple_prompt_model_grader_2, std_metrics_o4mini_medium_simple_prompt_model_grader_2 = average_and_std_metrics(metrics_o4mini_medium_simple_prompt_model_grader_2)
avg_metrics_o3_medium_simple_prompt_model_grader_2, std_metrics_o3_medium_simple_prompt_model_grader_2 = average_and_std_metrics(metrics_o3_medium_simple_prompt_model_grader_2)
avg_metrics_ftmodel_medium_simple_prompt_model_grader_2, std_metrics_ftmodel_medium_simple_prompt_model_grader_2 = average_and_std_metrics(metrics_ftmodel_medium_simple_prompt_model_grader_2)
model_metrics_avg = {
"o4-mini-medium-simple-prompt": avg_metrics_o4mini_medium_simple_prompt_model_grader_2,
"o3-medium-simple-prompt": avg_metrics_o3_medium_simple_prompt_model_grader_2,
"ftmodel-medium-simple-prompt": avg_metrics_ftmodel_medium_simple_prompt_model_grader_2
}
model_metrics_std = {
"o4-mini-medium-simple-prompt": std_metrics_o4mini_medium_simple_prompt_model_grader_2,
"o3-medium-simple-prompt": std_metrics_o3_medium_simple_prompt_model_grader_2,
"ftmodel-medium-simple-prompt": std_metrics_ftmodel_medium_simple_prompt_model_grader_2
}
plot_model_accuracies(model_metrics_avg, model_metrics_std, grader_title="Model Grader 2 Accuracy")
# Print mistakes where the model did not get the correct answer (score < 1.0)
mistakes = [
{"index": i, **res}
for i, res in enumerate(predictions_ftmodel_medium_simple_prompt_model_grader_2[0])
if res["score"] < 1.0
]
print(f"\nTotal mistakes: {len(mistakes)}")
for m in mistakes[5:10]:
print(f"\n[Sample {m['index']}]")
print(f" Model prediction: {m['model_prediction']}")
print(f" Reference answer: {m['reference_answer']}")
print(f" Score: {m['score']}")
Total mistakes: 84
[Sample 9]
Model prediction: ventilation-perfusion scan
Reference answer: lung ventilation-perfusion scan
Score: 0.989
[Sample 11]
Model prediction: autoimmune destruction of melanocytes (vitiligo)
Reference answer: autoimmune melanocyte destruction
Score: 0.991
[Sample 12]
Model prediction: contrast enhanced computed tomography of the abdomen
Reference answer: ct abdomen
Score: 0.812
[Sample 13]
Model prediction: unfractionated heparin
Reference answer: enoxaparin
Score: 0.428
[Sample 15]
Model prediction: t cell–mediated delayed (type iv) hypersensitivity
Reference answer: th1-mediated cytotoxicity
Score: 0.932
我们看到微调后准确率提高了约 5 个点。从前几个错误来看,模型倾向于严厉惩罚那些接近但临床上不完全相同的答案——例如未分级肝素与依诺肝素。它还会因为答案过长而扣分,即使答案是正确的,例如腹部增强 CT。
scores_o4 = [p['score'] for p in predictions_o4mini_medium_simple_prompt_model_grader_2[0]]
scores_ft = [p['score'] for p in predictions_ftmodel_medium_simple_prompt_model_grader_2[0]]
# Determine common bins for both histograms
all_scores = scores_o4 + scores_ft
bins = plt.hist(all_scores, bins=5, alpha=0)[1]
# Plot histograms and capture the counts
counts_o4, _, _ = plt.hist(
scores_o4,
bins=bins,
alpha=0.6,
label='o4-mini-medium-simple-prompt'
)
counts_ft, _, _ = plt.hist(
scores_ft,
bins=bins,
alpha=0.6,
label='ftmodel-medium-simple-prompt'
)
plt.title("Model Grader 2 Score Distribution by Model")
plt.xlabel("Score")
plt.ylabel("Count")
plt.ylim(top=75)
plt.legend()
# Print the bin counts
print("o4-mini-medium-simple-prompt bin counts:", counts_o4)
print("ftmodel-medium-simple-prompt bin counts:", counts_ft)
print("Max bin count (y-axis):", max(max(counts_o4), max(counts_ft)))
o4-mini-medium-simple-prompt bin counts: [ 2. 20. 13. 5. 60.]
ftmodel-medium-simple-prompt bin counts: [ 3. 12. 9. 6. 70.]
Max bin count (y-axis): 70.0
从得分分布来看,我们观察到 RFT 帮助将模型的预测从中间到低得分区域(0.2-0.6)转移到高得分区域(0.8-1.0)。由于评分器强调临床相似性而非词汇匹配,因此这种转变反映了更强的医学推理能力——而不仅仅是更好的措辞——根据我们的专家评分器。如(0.0-0.1)范围所示,一些本已较弱的预测得分更低,这暗示了残留的知识差距。
请注意,由于早期的 combined_grader
被设计为奖励词汇正确性,因此其准确性没有太大提高——这是符合预期的。这种差距再次强调了验证模型评分器的重要性,以及为什么您应该监控奖励黑客行为。在我们的案例中,我们使用 o3
来抽查评分行为,但领域专家评审至关重要。
模型的推理
微调模型的分析中的另一个重要点是推理摘要。模型可能在这些摘要中提供关键信息,探索它们以了解模型失败的地方可以驱动模型和评分器的系统提示更新。下面,我们展示了模型为展示其回答问题的方式而生成的推理摘要示例。
# Flatten the list of lists into a single list of dicts
predictions = {
"o4-mini": predictions_o4mini_medium_simple_prompt_model_grader_2,
"o3": predictions_o3_medium_simple_prompt_model_grader_2,
"ftmodel": predictions_ftmodel_medium_simple_prompt_model_grader_2,
}
for model_name, predictions in predictions.items():
all_preds = [item for sublist in predictions for item in sublist]
reasoning_tokens = [p['reasoning_tokens_used'] for p in all_preds if 'reasoning_tokens_used' in p]
mean_reasoning_tokens = np.mean(reasoning_tokens)
print(f"Mean reasoning_tokens_used {model_name}: {mean_reasoning_tokens:.0f}")
Mean reasoning_tokens_used o4-mini: 404
Mean reasoning_tokens_used o3: 384
Mean reasoning_tokens_used ftmodel: 925
微调模型花费更多推理令牌来思考问题。让我们通过推理摘要来可视化一个示例。
from IPython.display import Markdown, display
markdown_text = results_o4mini_model_grader_2[0][30]["summaries"]
display(Markdown(markdown_text))
**Choosing imaging study**
The user is looking for a single phrase regarding the imaging study for a 49-year-old male with chronic alcohol consumption and related symptoms. I'm considering whether to suggest a CT scan or MRI; however, a CT scan is often the initial choice for chronic pancreatitis. I’ll go with "abdominal ct scan" since it's standardized. I need to ensure I format it in lowercase without punctuation, following the user’s request. So the output is "abdominal ct scan."
markdown_text = results_ft_model_grader_2[0][30]["summaries"]
display(Markdown(markdown_text))
**Considering imaging options**
I'm analyzing the user's question about a 49-year-old male with symptoms suggesting steatorrhea, possibly indicating exocrine pancreatic insufficiency from chronic alcohol use. It raises concerns about chronic pancreatitis or pancreatic cancer. I think the best imaging choice is a contrast-enhanced CT scan of the abdomen because it effectively examines structural abnormalities. Alternatively, an endoscopic ultrasound could be more sensitive, but CT is generally preferred. So, my recommendation is to start with a contrast-enhanced CT scan.
**Determining the appropriate imaging study**
I'm analyzing the question about the most suitable imaging study for a patient with symptoms suggesting chronic pancreatitis. The standard approach for suspected chronic pancreatitis is a contrast-enhanced CT scan of the abdomen, as it effectively identifies pancreatic calcifications and structural changes. While MRCP and endoscopic ultrasound provide additional details, CT is often preferred as the initial test. Therefore, my answer should focus on recommending a "contrast-enhanced abdominal CT" as the next step in evaluation.
基础 o4-mini
的推理直接指向“腹部 CT 扫描”,主要关注小写格式,并且只给出了一个简短的“通常是首选”的理由。而微调模型则首先将患者的脂肪泻和饮酒史与慢性胰腺炎或癌症联系起来,权衡 CT 与 MRCP 和 EUS,并解释为什么增强腹部 CT 扫描能更好地显示钙化和结构改变。后者似乎更仔细,并且似乎学会了更详细地分解病例描述。
进一步提高得分
基础 o3
和我们微调的 o4-mini
在相同的样本上得分有时为零——这是一个危险信号,表明参考标签可能错误。在增加计算量之前,请投资于数据质量:让领域专家重新标记有噪声的切片,分析模型的推理,然后收紧评分器提示。干净、可信的数据和有条理的更新几乎总是比额外的 epoch 能带来更高的准确性。
结论
我们已经研究了如何设计评分器,为 o4-mini
在 RFT 期间提供所需的详细反馈。这种信号正是帮助模型在基线之上真正学习和改进的关键。模型评分器在这方面功能强大——但前提是它们设计得当。粗心的评分器或粗心的数据可能会发送错误的信号,并将模型引向错误的方向。
您现在已经准备好使用 OpenAI API 在您自己的模型上应用强化微调。我们期待看到您如何通过自定义评分器和更智能的模型行为来突破推理和工具使用的界限!
有关故障排除或后续步骤,请参阅 OpenAI 微调文档。