评估示例:推送通知摘要回归,
评估是面向任务且迭代的,它们是检查 LLM 集成表现并进行改进的最佳方式。
在以下评估中,我们将专注于检测我的提示更改是否会导致回归的任务。
我们的用例是:
- 我有一个 LLM 集成,它接收一系列推送通知,并将它们总结成一个单一的精简陈述。
- 我想检测提示更改是否会导致行为回归
评估结构
评估包含两部分:“Eval”(评估)和“Run”(运行)。“Eval”保存了测试标准的配置以及“Runs”的数据结构。一个 Eval 可以有许多由测试标准评估的 runs。
import openai
from openai.types.chat import ChatCompletion
import pydantic
import os
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")
用例
我们正在测试以下集成,即推送通知摘要,它接收多个推送通知并将它们合并为一个,这是一个聊天补全调用。
class PushNotifications(pydantic.BaseModel):
notifications: str
print(PushNotifications.model_json_schema())
DEVELOPER_PROMPT = """
你是一个有用的助手,可以总结推送通知。
你将收到一系列推送通知,你需要将它们合并成一个。
只输出最终的摘要,不要输出其他任何内容。
"""
def summarize_push_notification(push_notifications: str) -> ChatCompletion:
result = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "developer", "content": DEVELOPER_PROMPT},
{"role": "user", "content": push_notifications},
],
)
return result
example_push_notifications_list = PushNotifications(notifications="""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""")
result = summarize_push_notification(example_push_notifications_list.notifications)
print(result.choices[0].message.content)
设置你的评估
Eval 包含在多个Runs中共享的配置,它有两个组件:
- 数据源配置
data_source_config
- 你的未来Runs所遵循的模式(列)。data_source_config
使用 JSON Schema 来定义 Eval 中可用的变量。
- 测试标准
testing_criteria
- 你将如何确定你的集成是否对你的数据源的每一行都有效。
对于这个用例,我们想测试推送通知摘要补全是否良好,所以我们将以此为目标来设置我们的评估。
# 我们希望我们的输入数据在我们的变量中可用,所以我们将 item_schema 设置为
# PushNotifications.model_json_schema()
data_source_config = {
"type": "custom",
"item_schema": PushNotifications.model_json_schema(),
# 我们将从 API 上传补全,所以我们告诉 Eval 期望这个
"include_sample_schema": True,
}
这个 data_source_config 定义了在整个评估中可用的变量。
这个 item schema:
{
"properties": {
"notifications": {
"title": "Notifications",
"type": "string"
}
},
"required": ["notifications"],
"title": "PushNotifications",
"type": "object"
}
意味着我们将在评估中使用 {{item.notifications}}
变量。
"include_sample_schema": True
意味着我们将在评估中使用 {{sample.output_text}}
变量。
现在,我们将使用这些变量来设置我们的测试标准。
GRADER_DEVELOPER_PROMPT = """
将以下推送通知摘要标记为正确或不正确。
下面将提供推送通知和摘要。
一个好的推送通知摘要应该简洁明了。
如果它好,就标记为正确,否则标记为不正确。
"""
GRADER_TEMPLATE_PROMPT = """
推送通知: {{item.notifications}}
摘要: {{sample.output_text}}
"""
push_notification_grader = {
"name": "Push Notification Summary Grader",
"type": "label_model",
"model": "o3-mini",
"input": [
{
"role": "developer",
"content": GRADER_DEVELOPER_PROMPT,
},
{
"role": "user",
"content": GRADER_TEMPLATE_PROMPT,
},
],
"passing_labels": ["correct"],
"labels": ["correct", "incorrect"],
}
push_notification_grader
是一个模型评分器(llm-as-a-judge),它查看输入 {{item.notifications}}
和生成的摘要 {{sample.output_text}}
并将其标记为“正确”或“不正确”。
然后我们通过“passing_labels”来指示什么构成一个及格的答案。
注意:在底层,这使用了结构化输出来确保标签始终有效。
现在我们将创建我们的评估!并开始向其中添加数据
eval_create_result = openai.evals.create(
name="Push Notification Summary Workflow",
metadata={
"description": "This eval checks if the push notification summary is correct.",
},
data_source_config=data_source_config,
testing_criteria=[push_notification_grader],
)
eval_id = eval_create_result.id
创建 runs
现在我们已经设置好了包含 test_criteria 的评估,我们可以开始添加大量的 runs 了! 我们将从一些推送通知数据开始。
push_notification_data = [
"""
- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
"""
- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
"""
- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
"""
- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
"""
- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
"""
- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
"""
- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
"""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
"""
- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
"""
- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]
我们的第一个 run 将是我们上面完成函数 summarize_push_notification
中的默认评分器。
我们将遍历我们的数据集,进行补全调用,然后将它们提交给评分。
run_data = []
for push_notifications in push_notification_data:
result = summarize_push_notification(push_notifications)
run_data.append({
"item": PushNotifications(notifications=push_notifications).model_dump(),
"sample": result.model_dump()
})
eval_run_result = openai.evals.runs.create(
eval_id=eval_id,
name="baseline-run",
data_source={
"type": "jsonl",
"source": {
"type": "file_content",
"content": run_data,
}
},
)
print(eval_run_result)
# 在 UI 中查看结果
print(eval_run_result.report_url)
现在让我们模拟一个回归,这是我们的原始提示,让我们模拟一个开发者破坏了提示。
DEVELOPER_PROMPT = """
你是一个有用的助手,可以总结推送通知。
你将收到一系列推送通知,你需要将它们合并成一个。
只输出最终的摘要,不要输出其他任何内容。
"""
DEVELOPER_PROMPT = """
你是一个有用的助手,可以总结推送通知。
你将收到一系列推送通知,你需要将它们合并成一个。
你应该让摘要比需要的内容更长,并包含比必要信息更多的信息。
"""
def summarize_push_notification_bad(push_notifications: str) -> ChatCompletion:
result = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "developer", "content": DEVELOPER_PROMPT},
{"role": "user", "content": push_notifications},
],
)
return result
run_data = []
for push_notifications in push_notification_data:
result = summarize_push_notification_bad(push_notifications)
run_data.append({
"item": PushNotifications(notifications=push_notifications).model_dump(),
"sample": result.model_dump()
})
eval_run_result = openai.evals.runs.create(
eval_id=eval_id,
name="regression-run",
data_source={
"type": "jsonl",
"source": {
"type": "file_content",
"content": run_data,
}
},
)
print(eval_run_result.report_url)
如果你查看该报告,你会发现它的得分远低于 baseline-run。
恭喜你,你刚刚阻止了一个 bug 发送到用户手中
快速说明:
Evals 尚不支持 responses
API,但是,你可以使用以下代码将其转换为 completions
格式。
def summarize_push_notification_responses(push_notifications: str):
result = openai.responses.create(
model="gpt-4o",
input=[
{"role": "developer", "content": DEVELOPER_PROMPT},
{"role": "user", "content": push_notifications},
],
)
return result
def transform_response_to_completion(response):
completion = {
"model": response.model,
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": response.output_text
},
"finish_reason": "stop",
}]
}
return completion
run_data = []
for push_notifications in push_notification_data:
response = summarize_push_notification_responses(push_notifications)
completion = transform_response_to_completion(response)
run_data.append({
"item": PushNotifications(notifications=push_notifications).model_dump(),
"sample": completion
})
report_response = openai.evals.runs.create(
eval_id=eval_id,
name="responses-run",
data_source={
"type": "jsonl",
"source": {
"type": "file_content",
"content": run_data,
}
},
)
print(report_response.report_url)