评估示例:推送通知摘要器监控
评估(Evals)是面向任务且迭代的,它们是检查 LLM 集成表现并进行改进的最佳方式。
在接下来的评估中,我们将专注于检测提示更改是否引入回归的任务。
我们的用例是:
- 我们通过在生产聊天补全请求中设置
store=True
来记录聊天补全请求。请注意,您也可以在管理面板中启用“默认开启”的日志记录(https://platform.openai.com/settings/organization/data-controls/data-retention)。 - 我们想查看我们的提示更改是否引入了回归。
评估结构
评估包含两个部分:“评估”(Eval)和“运行”(Run)。“评估”包含测试标准的配置以及“运行”的数据结构。“评估”可以包含许多“运行”,每个“运行”都使用您的测试标准进行评估。
from openai import AsyncOpenAI
import os
import asyncio
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")
client = AsyncOpenAI()
用例
我们正在测试以下集成:一个推送通知摘要器,它接收多个推送通知并将它们合并为一个,这是一个聊天补全调用。
生成我们的测试数据
我将生成模拟的生产聊天补全请求,包含两个不同的提示版本,以测试它们各自的表现。第一个是“好的”提示,第二个是“坏的”提示。它们将具有不同的元数据,我们稍后会用到。
push_notification_data = [
"""
- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
"""
- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
"""
- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
"""
- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
"""
- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
"""
- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
"""
- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
"""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
"""
- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
"""
- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]
PROMPTS = [
(
"""
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
Output only the final summary, nothing else.
""",
"v1"
),
(
"""
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
The summary should be longer than it needs to be and include more information than is necessary.
Output only the final summary, nothing else.
""",
"v2"
)
]
tasks = []
for notifications in push_notification_data:
for (prompt, version) in PROMPTS:
tasks.append(client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "developer", "content": prompt},
{"role": "user", "content": notifications},
],
store=True,
metadata={"prompt_version": version, "usecase": "push_notifications_summarizer"},
))
await asyncio.gather(*tasks)
您可以在 https://platform.openai.com/logs 查看您刚刚创建的补全。
请确保补全已显示,因为它们是下一步的必要条件。
completions = await client.chat.completions.list()
assert completions.data, "No completions found. You may need to enable logs in your admin panel."
completions.data[0]
设置您的评估
评估(Eval)包含在多个运行(Runs)之间共享的配置,它有两个组成部分:
- 数据源配置
data_source_config
- 您的未来运行所遵循的模式(列)。data_source_config
使用 JSON Schema 来定义评估中可用的变量。
- 测试标准
testing_criteria
- 您将如何确定您的集成在数据源的每一行上是否正常工作。
对于这个用例,我们正在使用已存储的补全,因此我们将设置 data_source_config
重要提示 您可能会有许多不同的已存储补全用例,元数据是跟踪此信息以便评估保持专注和面向任务的最佳方式。
# 我们希望将输入数据作为变量提供,因此我们将 item_schema 设置为
# PushNotifications.model_json_schema()
data_source_config = {
"type": "stored_completions",
"metadata": {
"usecase": "push_notifications_summarizer"
}
}
此 data_source_config
定义了在整个评估中可用的变量。
已存储的补全配置在您的评估中提供了两个变量供您使用:
{{item.input}}
- 发送到补全调用的消息{{sample.output_text}}
- 来自助手的文本响应
现在,我们将使用这些变量来设置我们的测试标准。
GRADER_DEVELOPER_PROMPT = """
将以下推送通知摘要标记为正确或不正确。
下面将提供推送通知和摘要。
一个好的推送通知摘要应该简洁明了。
如果它好,则标记为正确,否则标记为不正确。
"""
GRADER_TEMPLATE_PROMPT = """
推送通知: {{item.input}}
摘要: {{sample.output_text}}
"""
push_notification_grader = {
"name": "Push Notification Summary Grader",
"type": "label_model",
"model": "o3-mini",
"input": [
{
"role": "developer",
"content": GRADER_DEVELOPER_PROMPT,
},
{
"role": "user",
"content": GRADER_TEMPLATE_PROMPT,
},
],
"passing_labels": ["correct"],
"labels": ["correct", "incorrect"],
}
push_notification_grader
是一个模型评分器(LLM 作为裁判),它查看输入 {{item.input}}
和生成的摘要 {{sample.output_text}}
并将其标记为“正确”或“不正确”。
注意:在底层,这使用了结构化输出来确保标签始终有效。
现在我们将创建我们的评估!并开始向其中添加数据
eval_create_result = await client.evals.create(
name="Push Notification Completion Monitoring",
metadata={"description": "This eval monitors completions"},
data_source_config=data_source_config,
testing_criteria=[push_notification_grader],
)
eval_id = eval_create_result.id
创建运行
现在我们已经设置好了包含 test_criteria
的评估,我们可以开始添加运行了。
我想比较我两个提示版本之间的性能
为此,我们只需将源定义为“stored_completions”,并为每个提示版本设置元数据过滤器。
# 评估 prompt_version=v1
eval_run_result = await client.evals.runs.create(
eval_id=eval_id,
name="v1-run",
data_source={
"type": "completions",
"source": {
"type": "stored_completions",
"metadata": {
"prompt_version": "v1",
}
}
}
)
print(eval_run_result.report_url)
# 评估 prompt_version=v2
eval_run_result_v2 = await client.evals.runs.create(
eval_id=eval_id,
name="v2-run",
data_source={
"type": "completions",
"source": {
"type": "stored_completions",
"metadata": {
"prompt_version": "v2",
}
}
}
)
print(eval_run_result_v2.report_url)
为了确保万无一失,让我们看看这个提示如何与 4o(而不是 4o-mini)一起运行,并以两个提示版本作为起点。
我们所要做的就是引用输入消息({{item.input}}
)并将模型设置为 4o。由于我们还没有任何针对 4o 的已存储补全,因此此评估运行将生成新的补全。
tasks = []
for prompt_version in ["v1", "v2"]:
tasks.append(client.evals.runs.create(
eval_id=eval_id,
name=f"post-fix-new-model-run-{prompt_version}",
data_source={
"type": "completions",
"input_messages": {
"type": "item_reference",
"item_reference": "item.input",
},
"model": "gpt-4o",
"source": {
"type": "stored_completions",
"metadata": {
"prompt_version": prompt_version,
}
}
},
))
result = await asyncio.gather(*tasks)
for run in result:
print(run.report_url)
如果您查看该报告,您会发现我们可以看到 prompt_version=v2
存在回归!