评估示例:推送通知摘要器监控

评估(Evals)是面向任务且迭代的,它们是检查 LLM 集成表现并进行改进的最佳方式。

在接下来的评估中,我们将专注于检测提示更改是否引入回归的任务。

我们的用例是:

  1. 我们通过在生产聊天补全请求中设置 store=True 来记录聊天补全请求。请注意,您也可以在管理面板中启用“默认开启”的日志记录(https://platform.openai.com/settings/organization/data-controls/data-retention)。
  2. 我们想查看我们的提示更改是否引入了回归。

评估结构

评估包含两个部分:“评估”(Eval)和“运行”(Run)。“评估”包含测试标准的配置以及“运行”的数据结构。“评估”可以包含许多“运行”,每个“运行”都使用您的测试标准进行评估。

from openai import AsyncOpenAI
import os
import asyncio

os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")
client = AsyncOpenAI()

用例

我们正在测试以下集成:一个推送通知摘要器,它接收多个推送通知并将它们合并为一个,这是一个聊天补全调用。

生成我们的测试数据

我将生成模拟的生产聊天补全请求,包含两个不同的提示版本,以测试它们各自的表现。第一个是“好的”提示,第二个是“坏的”提示。它们将具有不同的元数据,我们稍后会用到。

push_notification_data = [
        """

- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
        """

- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
        """

- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
        """

- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
        """

- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
        """

- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
        """

- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
        """

- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
        """

- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
        """

- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]
PROMPTS = [
    (
        """
        You are a helpful assistant that summarizes push notifications.
        You are given a list of push notifications and you need to collapse them into a single one.
        Output only the final summary, nothing else.
        """,
        "v1"
    ),
    (
        """
        You are a helpful assistant that summarizes push notifications.
        You are given a list of push notifications and you need to collapse them into a single one.
        The summary should be longer than it needs to be and include more information than is necessary.
        Output only the final summary, nothing else.
        """,
        "v2"
    )
]

tasks = []
for notifications in push_notification_data:
    for (prompt, version) in PROMPTS:
        tasks.append(client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "developer", "content": prompt},
                {"role": "user", "content": notifications},
            ],
            store=True,
            metadata={"prompt_version": version, "usecase": "push_notifications_summarizer"},
        ))
await asyncio.gather(*tasks)

您可以在 https://platform.openai.com/logs 查看您刚刚创建的补全。

请确保补全已显示,因为它们是下一步的必要条件。

completions = await client.chat.completions.list()
assert completions.data, "No completions found. You may need to enable logs in your admin panel."
completions.data[0]

设置您的评估

评估(Eval)包含在多个运行(Runs)之间共享的配置,它有两个组成部分:

  1. 数据源配置 data_source_config - 您的未来运行所遵循的模式(列)。
    • data_source_config 使用 JSON Schema 来定义评估中可用的变量。
  2. 测试标准 testing_criteria - 您将如何确定您的集成在数据源的每一上是否正常工作。

对于这个用例,我们正在使用已存储的补全,因此我们将设置 data_source_config

重要提示 您可能会有许多不同的已存储补全用例,元数据是跟踪此信息以便评估保持专注和面向任务的最佳方式。

# 我们希望将输入数据作为变量提供,因此我们将 item_schema 设置为
# PushNotifications.model_json_schema()
data_source_config = {
    "type": "stored_completions",
    "metadata": {
        "usecase": "push_notifications_summarizer"
    }
}

data_source_config 定义了在整个评估中可用的变量。

已存储的补全配置在您的评估中提供了两个变量供您使用:

  1. {{item.input}} - 发送到补全调用的消息
  2. {{sample.output_text}} - 来自助手的文本响应

现在,我们将使用这些变量来设置我们的测试标准。

GRADER_DEVELOPER_PROMPT = """
将以下推送通知摘要标记为正确或不正确。
下面将提供推送通知和摘要。
一个好的推送通知摘要应该简洁明了。
如果它好,则标记为正确,否则标记为不正确。
"""
GRADER_TEMPLATE_PROMPT = """
推送通知: {{item.input}}
摘要: {{sample.output_text}}
"""
push_notification_grader = {
    "name": "Push Notification Summary Grader",
    "type": "label_model",
    "model": "o3-mini",
    "input": [
        {
            "role": "developer",
            "content": GRADER_DEVELOPER_PROMPT,
        },
        {
            "role": "user",
            "content": GRADER_TEMPLATE_PROMPT,
        },
    ],
    "passing_labels": ["correct"],
    "labels": ["correct", "incorrect"],
}

push_notification_grader 是一个模型评分器(LLM 作为裁判),它查看输入 {{item.input}} 和生成的摘要 {{sample.output_text}} 并将其标记为“正确”或“不正确”。

注意:在底层,这使用了结构化输出来确保标签始终有效。

现在我们将创建我们的评估!并开始向其中添加数据

eval_create_result = await client.evals.create(
    name="Push Notification Completion Monitoring",
    metadata={"description": "This eval monitors completions"},
    data_source_config=data_source_config,
    testing_criteria=[push_notification_grader],
)

eval_id = eval_create_result.id

创建运行

现在我们已经设置好了包含 test_criteria 的评估,我们可以开始添加运行了。 我想比较我两个提示版本之间的性能

为此,我们只需将源定义为“stored_completions”,并为每个提示版本设置元数据过滤器。

# 评估 prompt_version=v1
eval_run_result = await client.evals.runs.create(
    eval_id=eval_id,
    name="v1-run",
    data_source={
        "type": "completions",
        "source": {
            "type": "stored_completions",
            "metadata": {
                "prompt_version": "v1",
            }
        }
    }
)
print(eval_run_result.report_url)
# 评估 prompt_version=v2
eval_run_result_v2 = await client.evals.runs.create(
    eval_id=eval_id,
    name="v2-run",
    data_source={
        "type": "completions",
        "source": {
            "type": "stored_completions",
            "metadata": {
                "prompt_version": "v2",
            }
        }
    }
)
print(eval_run_result_v2.report_url)

为了确保万无一失,让我们看看这个提示如何与 4o(而不是 4o-mini)一起运行,并以两个提示版本作为起点。

我们所要做的就是引用输入消息({{item.input}})并将模型设置为 4o。由于我们还没有任何针对 4o 的已存储补全,因此此评估运行将生成新的补全。

tasks = []
for prompt_version in ["v1", "v2"]:
    tasks.append(client.evals.runs.create(
        eval_id=eval_id,
        name=f"post-fix-new-model-run-{prompt_version}",
        data_source={
            "type": "completions",
            "input_messages": {
                "type": "item_reference",
                "item_reference": "item.input",
            },
            "model": "gpt-4o",
            "source": {
                "type": "stored_completions",
                "metadata": {
                    "prompt_version": prompt_version,
                }
            }
        },
    ))
result = await asyncio.gather(*tasks)
for run in result:
    print(run.report_url)

如果您查看该报告,您会发现我们可以看到 prompt_version=v2 存在回归!

恭喜您,您刚刚发现了一个错误,您可以撤销它,或者进行另一次提示更改,等等!