评估结构

评估包含两个部分：“评估”（Eval）和“运行”（Run）。“评估”包含测试标准的配置以及“运行”的数据结构。“评估”可以包含许多“运行”，每个“运行”都使用您的测试标准进行评估。

from openai import AsyncOpenAI
import os
import asyncio

os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")
client = AsyncOpenAI()

用例

我们正在测试以下集成：一个推送通知摘要器，它接收多个推送通知并将它们合并为一个，这是一个聊天补全调用。

生成我们的测试数据

我将生成模拟的生产聊天补全请求，包含两个不同的提示版本，以测试它们各自的表现。第一个是“好的”提示，第二个是“坏的”提示。它们将具有不同的元数据，我们稍后会用到。

push_notification_data = [
        """

- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
        """

- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
        """

- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
        """

- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
        """

- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
        """

- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
        """

- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
        """

- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
        """

- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
        """

- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]

PROMPTS = [
    (
        """
        You are a helpful assistant that summarizes push notifications.
        You are given a list of push notifications and you need to collapse them into a single one.
        Output only the final summary, nothing else.
        """,
        "v1"
    ),
    (
        """
        You are a helpful assistant that summarizes push notifications.
        You are given a list of push notifications and you need to collapse them into a single one.
        The summary should be longer than it needs to be and include more information than is necessary.
        Output only the final summary, nothing else.
        """,
        "v2"
    )
]

tasks = []
for notifications in push_notification_data:
    for (prompt, version) in PROMPTS:
        tasks.append(client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "developer", "content": prompt},
                {"role": "user", "content": notifications},
            ],
            store=True,
            metadata={"prompt_version": version, "usecase": "push_notifications_summarizer"},
        ))
await asyncio.gather(*tasks)

您可以在 https://platform.openai.com/logs 查看您刚刚创建的补全。

请确保补全已显示，因为它们是下一步的必要条件。

completions = await client.chat.completions.list()
assert completions.data, "No completions found. You may need to enable logs in your admin panel."
completions.data[0]

设置您的评估

评估（Eval）包含在多个运行（Runs）之间共享的配置，它有两个组成部分：

数据源配置 data_source_config - 您的未来运行所遵循的模式（列）。
- data_source_config 使用 JSON Schema 来定义评估中可用的变量。
测试标准 testing_criteria - 您将如何确定您的集成在数据源的每一行上是否正常工作。

对于这个用例，我们正在使用已存储的补全，因此我们将设置 data_source_config

重要提示 您可能会有许多不同的已存储补全用例，元数据是跟踪此信息以便评估保持专注和面向任务的最佳方式。

# 我们希望将输入数据作为变量提供，因此我们将 item_schema 设置为
# PushNotifications.model_json_schema()
data_source_config = {
    "type": "stored_completions",
    "metadata": {
        "usecase": "push_notifications_summarizer"
    }
}

此 data_source_config 定义了在整个评估中可用的变量。

已存储的补全配置在您的评估中提供了两个变量供您使用：

{{item.input}} - 发送到补全调用的消息
{{sample.output_text}} - 来自助手的文本响应

现在，我们将使用这些变量来设置我们的测试标准。

GRADER_DEVELOPER_PROMPT = """
将以下推送通知摘要标记为正确或不正确。
下面将提供推送通知和摘要。
一个好的推送通知摘要应该简洁明了。
如果它好，则标记为正确，否则标记为不正确。
"""
GRADER_TEMPLATE_PROMPT = """
推送通知： {{item.input}}
摘要： {{sample.output_text}}
"""
push_notification_grader = {
    "name": "Push Notification Summary Grader",
    "type": "label_model",
    "model": "o3-mini",
    "input": [
        {
            "role": "developer",
            "content": GRADER_DEVELOPER_PROMPT,
        },
        {
            "role": "user",
            "content": GRADER_TEMPLATE_PROMPT,
        },
    ],
    "passing_labels": ["correct"],
    "labels": ["correct", "incorrect"],
}

push_notification_grader 是一个模型评分器（LLM 作为裁判），它查看输入 {{item.input}} 和生成的摘要 {{sample.output_text}} 并将其标记为“正确”或“不正确”。

注意：在底层，这使用了结构化输出来确保标签始终有效。

现在我们将创建我们的评估！并开始向其中添加数据

eval_create_result = await client.evals.create(
    name="Push Notification Completion Monitoring",
    metadata={"description": "This eval monitors completions"},
    data_source_config=data_source_config,
    testing_criteria=[push_notification_grader],
)

eval_id = eval_create_result.id

创建运行

现在我们已经设置好了包含 test_criteria 的评估，我们可以开始添加运行了。我想比较我两个提示版本之间的性能

为此，我们只需将源定义为“stored_completions”，并为每个提示版本设置元数据过滤器。

# 评估 prompt_version=v1
eval_run_result = await client.evals.runs.create(
    eval_id=eval_id,
    name="v1-run",
    data_source={
        "type": "completions",
        "source": {
            "type": "stored_completions",
            "metadata": {
                "prompt_version": "v1",
            }
        }
    }
)
print(eval_run_result.report_url)

# 评估 prompt_version=v2
eval_run_result_v2 = await client.evals.runs.create(
    eval_id=eval_id,
    name="v2-run",
    data_source={
        "type": "completions",
        "source": {
            "type": "stored_completions",
            "metadata": {
                "prompt_version": "v2",
            }
        }
    }
)
print(eval_run_result_v2.report_url)

为了确保万无一失，让我们看看这个提示如何与 4o（而不是 4o-mini）一起运行，并以两个提示版本作为起点。

我们所要做的就是引用输入消息（{{item.input}}）并将模型设置为 4o。由于我们还没有任何针对 4o 的已存储补全，因此此评估运行将生成新的补全。

tasks = []
for prompt_version in ["v1", "v2"]:
    tasks.append(client.evals.runs.create(
        eval_id=eval_id,
        name=f"post-fix-new-model-run-{prompt_version}",
        data_source={
            "type": "completions",
            "input_messages": {
                "type": "item_reference",
                "item_reference": "item.input",
            },
            "model": "gpt-4o",
            "source": {
                "type": "stored_completions",
                "metadata": {
                    "prompt_version": prompt_version,
                }
            }
        },
    ))
result = await asyncio.gather(*tasks)
for run in result:
    print(run.report_url)

如果您查看该报告，您会发现我们可以看到 prompt_version=v2 存在回归！

评估结构

用例

生成我们的测试数据

设置您的评估

创建运行

恭喜您，您刚刚发现了一个错误，您可以撤销它，或者进行另一次提示更改，等等！