评估结构

评估包含两部分：“Eval”（评估）和“Run”（运行）。“Eval”保存了测试标准的配置以及“Runs”的数据结构。一个 Eval 可以有许多由测试标准评估的 runs。

import openai
from openai.types.chat import ChatCompletion
import pydantic
import os

os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")

用例

我们正在测试以下集成，即推送通知摘要，它接收多个推送通知并将它们合并为一个，这是一个聊天补全调用。

class PushNotifications(pydantic.BaseModel):
    notifications: str

print(PushNotifications.model_json_schema())

DEVELOPER_PROMPT = """
你是一个有用的助手，可以总结推送通知。
你将收到一系列推送通知，你需要将它们合并成一个。
只输出最终的摘要，不要输出其他任何内容。
"""

def summarize_push_notification(push_notifications: str) -> ChatCompletion:
    result = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "developer", "content": DEVELOPER_PROMPT},
            {"role": "user", "content": push_notifications},
        ],
    )
    return result

example_push_notifications_list = PushNotifications(notifications="""

- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""")
result = summarize_push_notification(example_push_notifications_list.notifications)
print(result.choices[0].message.content)

设置你的评估

Eval 包含在多个Runs中共享的配置，它有两个组件：

数据源配置 data_source_config - 你的未来Runs所遵循的模式（列）。
- data_source_config 使用 JSON Schema 来定义 Eval 中可用的变量。
测试标准 testing_criteria - 你将如何确定你的集成是否对你的数据源的每一行都有效。

对于这个用例，我们想测试推送通知摘要补全是否良好，所以我们将以此为目标来设置我们的评估。

# 我们希望我们的输入数据在我们的变量中可用，所以我们将 item_schema 设置为
# PushNotifications.model_json_schema()
data_source_config = {
    "type": "custom",
    "item_schema": PushNotifications.model_json_schema(),
    # 我们将从 API 上传补全，所以我们告诉 Eval 期望这个
    "include_sample_schema": True,
}

这个 data_source_config 定义了在整个评估中可用的变量。

这个 item schema：

{
  "properties": {
    "notifications": {
      "title": "Notifications",
      "type": "string"
    }
  },
  "required": ["notifications"],
  "title": "PushNotifications",
  "type": "object"
}

意味着我们将在评估中使用 {{item.notifications}} 变量。

"include_sample_schema": True 意味着我们将在评估中使用 {{sample.output_text}} 变量。

现在，我们将使用这些变量来设置我们的测试标准。

GRADER_DEVELOPER_PROMPT = """
将以下推送通知摘要标记为正确或不正确。
下面将提供推送通知和摘要。
一个好的推送通知摘要应该简洁明了。
如果它好，就标记为正确，否则标记为不正确。
"""
GRADER_TEMPLATE_PROMPT = """
推送通知： {{item.notifications}}
摘要： {{sample.output_text}}
"""
push_notification_grader = {
    "name": "Push Notification Summary Grader",
    "type": "label_model",
    "model": "o3-mini",
    "input": [
        {
            "role": "developer",
            "content": GRADER_DEVELOPER_PROMPT,
        },
        {
            "role": "user",
            "content": GRADER_TEMPLATE_PROMPT,
        },
    ],
    "passing_labels": ["correct"],
    "labels": ["correct", "incorrect"],
}

push_notification_grader 是一个模型评分器（llm-as-a-judge），它查看输入 {{item.notifications}} 和生成的摘要 {{sample.output_text}} 并将其标记为“正确”或“不正确”。然后我们通过“passing_labels”来指示什么构成一个及格的答案。

注意：在底层，这使用了结构化输出来确保标签始终有效。

现在我们将创建我们的评估！并开始向其中添加数据

eval_create_result = openai.evals.create(
    name="Push Notification Summary Workflow",
    metadata={
        "description": "This eval checks if the push notification summary is correct.",
    },
    data_source_config=data_source_config,
    testing_criteria=[push_notification_grader],
)

eval_id = eval_create_result.id

创建 runs

现在我们已经设置好了包含 test_criteria 的评估，我们可以开始添加大量的 runs 了！我们将从一些推送通知数据开始。

push_notification_data = [
        """

- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
        """

- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
        """

- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
        """

- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
        """

- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
        """

- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
        """

- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
        """

- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
        """

- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
        """

- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]

我们的第一个 run 将是我们上面完成函数 summarize_push_notification 中的默认评分器。我们将遍历我们的数据集，进行补全调用，然后将它们提交给评分。

run_data = []
for push_notifications in push_notification_data:
    result = summarize_push_notification(push_notifications)
    run_data.append({
        "item": PushNotifications(notifications=push_notifications).model_dump(),
        "sample": result.model_dump()
    })

eval_run_result = openai.evals.runs.create(
    eval_id=eval_id,
    name="baseline-run",
    data_source={
        "type": "jsonl",
        "source": {
            "type": "file_content",
            "content": run_data,
        }
    },
)
print(eval_run_result)
# 在 UI 中查看结果
print(eval_run_result.report_url)

现在让我们模拟一个回归，这是我们的原始提示，让我们模拟一个开发者破坏了提示。

DEVELOPER_PROMPT = """
你是一个有用的助手，可以总结推送通知。
你将收到一系列推送通知，你需要将它们合并成一个。
只输出最终的摘要，不要输出其他任何内容。
"""

DEVELOPER_PROMPT = """
你是一个有用的助手，可以总结推送通知。
你将收到一系列推送通知，你需要将它们合并成一个。
你应该让摘要比需要的内容更长，并包含比必要信息更多的信息。
"""

def summarize_push_notification_bad(push_notifications: str) -> ChatCompletion:
    result = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "developer", "content": DEVELOPER_PROMPT},
            {"role": "user", "content": push_notifications},
        ],
    )
    return result

run_data = []
for push_notifications in push_notification_data:
    result = summarize_push_notification_bad(push_notifications)
    run_data.append({
        "item": PushNotifications(notifications=push_notifications).model_dump(),
        "sample": result.model_dump()
    })

eval_run_result = openai.evals.runs.create(
    eval_id=eval_id,
    name="regression-run",
    data_source={
        "type": "jsonl",
        "source": {
            "type": "file_content",
            "content": run_data,
        }
    },
)
print(eval_run_result.report_url)

如果你查看该报告，你会发现它的得分远低于 baseline-run。

恭喜你，你刚刚阻止了一个 bug 发送到用户手中

快速说明： Evals 尚不支持 responses API，但是，你可以使用以下代码将其转换为 completions 格式。

def summarize_push_notification_responses(push_notifications: str):
    result = openai.responses.create(
                model="gpt-4o",
                input=[
                    {"role": "developer", "content": DEVELOPER_PROMPT},
                    {"role": "user", "content": push_notifications},
                ],
            )
    return result
def transform_response_to_completion(response):
    completion = {
        "model": response.model,
        "choices": [{
        "index": 0,
        "message": {
            "role": "assistant",
            "content": response.output_text
        },
        "finish_reason": "stop",
    }]
    }
    return completion

run_data = []
for push_notifications in push_notification_data:
    response = summarize_push_notification_responses(push_notifications)
    completion = transform_response_to_completion(response)
    run_data.append({
        "item": PushNotifications(notifications=push_notifications).model_dump(),
        "sample": completion
    })

report_response = openai.evals.runs.create(
    eval_id=eval_id,
    name="responses-run",
    data_source={
        "type": "jsonl",
        "source": {
            "type": "file_content",
            "content": run_data,
        }
    },
)
print(report_response.report_url)