评估结构

评估包含两部分：“Eval”和“Run”。“Eval”保存了测试标准的配置以及“Run”的数据结构。“Eval”可以包含多个“Run”，这些“Run”将根据您的测试标准进行评估。

import pydantic
import openai
from openai.types.chat import ChatCompletion
import os

os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")

用例

我们正在测试以下集成，一个推送通知摘要器，它接收多个推送通知并将它们合并成一条消息。

class PushNotifications(pydantic.BaseModel):
    notifications: str

print(PushNotifications.model_json_schema())

DEVELOPER_PROMPT = """
你是一个乐于助人的助手，可以总结推送通知。
你将收到一个推送通知列表，你需要将它们合并成一个。
只输出最终的摘要，不要输出其他任何内容。
"""

def summarize_push_notification(push_notifications: str) -> ChatCompletion:
    result = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "developer", "content": DEVELOPER_PROMPT},
            {"role": "user", "content": push_notifications},
        ],
    )
    return result

example_push_notifications_list = PushNotifications(notifications="""

- 警报：检测到未经授权的登录尝试。
- 您的博客文章有新评论：“见解深刻！”
- 今晚的晚餐食谱：意大利面配时蔬。
""")
result = summarize_push_notification(example_push_notifications_list.notifications)
print(result.choices[0].message.content)

设置您的评估

Eval 包含跨多个Runs共享的配置，它有两个组件：

数据源配置 data_source_config - 您的未来Runs所遵循的模式（列）。
- data_source_config 使用 JSON Schema 来定义 Eval 中可用的变量。
测试标准 testing_criteria - 您将如何确定您的集成是否对数据源的每一行都有效。

对于这个用例，我们想测试推送通知摘要完成是否良好，所以我们将以此为目标来设置我们的评估。

# 我们希望我们的输入数据在我们的变量中可用，所以我们将 item_schema 设置为
# PushNotifications.model_json_schema()
data_source_config = {
    "type": "custom",
    "item_schema": PushNotifications.model_json_schema(),
    # 我们将从 API 上传 completions，所以我们告诉 Eval 期望这个
    "include_sample_schema": True,
}

这个 data_source_config 定义了在整个评估中可用的变量。

这个项目模式：

{
  "properties": {
    "notifications": {
      "title": "Notifications",
      "type": "string"
    }
  },
  "required": ["notifications"],
  "title": "PushNotifications",
  "type": "object"
}

意味着我们将在我们的评估中使用 {{item.notifications}} 变量。

"include_sample_schema": True 意味着我们将在我们的评估中使用 {{sample.output_text}} 变量。

现在，我们将使用这些变量来设置我们的测试标准。

GRADER_DEVELOPER_PROMPT = """
将以下推送通知摘要分类到以下类别：

1. 简洁明了
2. 遗漏重要信息
3. 冗长
4. 不清晰
5. 意义不明
6. 其他

您将收到原始推送通知列表和摘要，如下所示：

<push_notifications>
...通知列表...
</push_notifications>
<summary>
...摘要...
</summary>

您应该只选择以上类别中的一个，选择最接近的类别并说明原因。
"""
GRADER_TEMPLATE_PROMPT = """
<push_notifications>{{item.notifications}}</push_notifications>
<summary>{{sample.output_text}}</summary>
"""
push_notification_grader = {
    "name": "推送通知摘要评分器",
    "type": "label_model",
    "model": "o3-mini",
    "input": [
        {
            "role": "developer",
            "content": GRADER_DEVELOPER_PROMPT,
        },
        {
            "role": "user",
            "content": GRADER_TEMPLATE_PROMPT,
        },
    ],
    "passing_labels": ["concise-and-snappy"],
    "labels": [
        "concise-and-snappy",
        "drops-important-information",
        "verbose",
        "unclear",
        "obscures-meaning",
        "other",
    ],
}

push_notification_grader 是一个模型评分器（llm-as-a-judge），它查看输入 {{item.notifications}} 和生成的摘要 {{sample.output_text}} 并将其标记为“正确”或“不正确”。然后，我们通过“passing_labels”来指示什么构成一个通过的答案。

注意：在底层，这使用了结构化输出来确保标签始终有效。

现在我们将创建我们的评估，并开始向其中添加数据！

eval_create_result = openai.evals.create(
    name="推送通知批量实验评估",
    metadata={
        "description": "此评估测试多种提示和模型以找到最佳性能组合。",
    },
    data_source_config=data_source_config,
    testing_criteria=[push_notification_grader],
)
eval_id = eval_create_result.id

创建 runs

现在我们已经设置好了包含 testing_criteria 的评估，我们可以开始添加大量的 runs 了！我们将从一些推送通知数据开始。

push_notification_data = [
        """

- Sarah的新消息：“晚点能给我回个电话吗？”
- 您的包裹已送达！
- 限时抢购：未来2小时内电子产品享受20%折扣！
""",
        """

- 天气警报：您所在地区预计有雷暴。
- 提醒：下午3点有医生预约。
- John在Instagram上喜欢您的照片。
""",
        """

- 最新消息：当地选举结果已公布。
- 您的每日锻炼摘要已准备就绪。
- 查看您的每周屏幕时间报告。
""",
        """

- 您的车将在2分钟内到达。
- 您的杂货订单已发货。
- 不要错过今晚您最喜欢的节目的季终集！
""",
        """

- 活动提醒：音乐会晚上7点开始。
- 您最喜欢的球队刚刚得分！
- 回忆：3年前的记忆。
""",
        """

- 低电量警报：请为您的设备充电。
- 您的朋友Mike在附近。
- “科技时刻”播客的新剧集上线了！
""",
        """

- 系统更新可用。
- 月度账单已准备就绪。
- 您下次会议将在15分钟后开始。
""",
        """

- 警报：检测到未经授权的登录尝试。
- 您的博客文章有新评论：“见解深刻！”
- 今晚的晚餐食谱：意大利面配时蔬。
""",
        """

- 特别优惠：购买任何早餐订单均可免费获赠咖啡。
- 您的航班延误了30分钟。
- 新电影上映：“Beyond的冒险”现已上线。
""",
        """

- 交通警报：主街报告了事故。
- 包裹正在派送中：预计下午5点前送达。
- 新朋友建议：与Emma联系。
"""]

现在我们将设置一系列提示进行测试。

我们想测试一个基本提示，以及几个变体：

在一个变体中，我们只有基本提示
在下一个变体中，我们将包含一些我们想要的摘要示例
在最后一个变体中，我们将包含正面和负面示例。

我们还将包含一个模型列表供使用。

PROMPT_PREFIX = """
你是一个乐于助人的助手，它接收一个推送通知数组并返回它们的合并摘要。
推送通知将如下提供：
<push_notifications>
...通知列表...
</push_notifications>

您应该只返回摘要，不要返回其他任何内容。
"""

PROMPT_VARIATION_BASIC = f"""
{PROMPT_PREFIX}

您应该返回一个简洁明了的摘要。
"""

PROMPT_VARIATION_WITH_EXAMPLES = f"""
{PROMPT_VARIATION_BASIC}

这是一个好的摘要示例：
<push_notifications>

- 交通警报：主街报告了事故。- 包裹正在派送中：预计下午5点前送达。- 新朋友建议：与Emma联系。
</push_notifications>
<summary>
交通警报，包裹预计下午5点送达，建议交个新朋友（艾玛）。
</summary>
"""

PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES = f"""
{PROMPT_VARIATION_WITH_EXAMPLES}

这是一个不好的摘要示例：
<push_notifications>

- 交通警报：主街报告了事故。- 包裹正在派送中：预计下午5点前送达。- 新朋友建议：与Emma联系。
</push_notifications>
<summary>
主街报告了交通警报。您有一个包裹将在下午5点前送达，艾玛是建议您认识的新朋友。
</summary>
"""

prompts = [
    ("basic", PROMPT_VARIATION_BASIC),
    ("with_examples", PROMPT_VARIATION_WITH_EXAMPLES),
    ("with_negative_examples", PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES),
]

models = ["gpt-4o", "gpt-4o-mini", "o3-mini"]

现在我们可以循环遍历所有提示和所有模型来一次性测试许多配置！

我们将使用 'completion' 运行数据源，并为我们的推送通知列表使用模板变量。

OpenAI 将为您处理 completions 调用，并填充“sample.output_text”。

for prompt_name, prompt in prompts:
    for model in models:
        run_data_source = {
            "type": "completions",
            "input_messages": {
                "type": "template",
                "template": [
                    {
                        "role": "developer",
                        "content": prompt,
                    },
                    {
                        "role": "user",
                        "content": "<push_notifications>{{item.notifications}}</push_notifications>",
                    },
                ],
            },
            "model": model,
            "source": {
                "type": "file_content",
                "content": [
                    {
                        "item": PushNotifications(notifications=notification).model_dump()
                    }
                    for notification in push_notification_data
                ],
            },
        }

        run_create_result = openai.evals.runs.create(
            eval_id=eval_id,
            name=f"bulk_{prompt_name}_{model}",
            data_source=run_data_source,
        )
        print(f"Report URL {model}, {prompt_name}:", run_create_result.report_url)

评估结构

用例

设置您的评估

创建 runs

恭喜您，您刚刚跨数据集测试了 9 种不同的提示和模型变体！