结构化输出评估食谱

本笔记本将引导您完成一系列专注、可运行的示例,演示如何使用 OpenAI Evals 框架来测试、评分和迭代需要大型语言模型生成结构化输出的任务

为什么这很重要?
生产系统通常依赖于 JSON、SQL 或特定于域的格式。依赖于抽查或临时调整提示会很快失效。相反,您可以将期望编码为自动化评估,让您的团队能够安全地发布,而不是建立在沙子上。

快速浏览

  • 第 1 部分 – 先决条件:环境变量和包设置
  • 第 2 部分 – 演练:代码符号提取:端到端演示,用于评估模型从源代码中提取函数和类名的能力。我们保持原始逻辑不变,仅在其周围添加文档。
  • 第 3 部分 – 其他食谱:常见的生产模式草图,例如将情感提取作为额外的代码示例进行评估。
  • 第 4 部分 – 结果探索:用于提取运行输出和深入研究失败的轻量级助手。

先决条件

  1. 安装依赖项(显示最低版本):
pip install --upgrade openai
  1. 进行身份验证,导出您的密钥:
export OPENAI_API_KEY="sk-..."
  1. 可选:如果您计划批量运行评估,请设置一个具有适当限制的组织级密钥

用例 1:代码符号提取

目标是从 OpenAI SDK 内的 Python 文件中提取所有函数、类和常量符号
对于每个文件,我们要求模型发出结构化 JSON,如下所示:

{
  "symbols": [
    {"name": "OpenAI", "kind": "class"},
    {"name": "Evals", "kind": "module"},
    ...
  ]
}

然后,一个评分模型根据完整性(我们是否捕获了所有符号?)和质量(种类是否正确?)以 1-7 的等级进行评分。

使用自定义数据集评估代码质量提取

让我们通过一个示例来演练如何使用 OpenAI Evals 框架和自定义内存数据集来评估模型从代码中提取符号的能力。

初始化 SDK 客户端

使用我们上面导出的 OPENAI_API_KEY 创建一个 openai.OpenAI 客户端。没有它,任何东西都无法运行。

%pip install --upgrade openai pandas rich --quiet



import os
import time
import openai
from rich import print
import pandas as pd

client = openai.OpenAI(
    api_key=os.getenv("OPENAI_API_KEY") or os.getenv("_OPENAI_API_KEY"),
)
 [1m[ [0m [34;49mnotice [0m [1;39;49m] [0m [39;49m pip 有新版本可用: [0m [31;49m24.0 [0m [39;49m -> [0m [32;49m25.1.1 [0m
 [1m[ [0m [34;49mnotice [0m [1;39;49m] [0m [39;49m 要更新,请运行: [0m [32;49mpip install --upgrade pip [0m
注意:您可能需要重新启动内核才能使用更新的包。

数据集工厂和评分标准

  • get_dataset 通过读取几个 SDK 文件来构建一个小的内存数据集。
  • structured_output_grader 定义了一个详细的评估标准。
  • client.evals.create(...) 将评估注册到平台。
def get_dataset(limit=None):
    openai_sdk_file_path = os.path.dirname(openai.__file__)

    file_paths = [
        os.path.join(openai_sdk_file_path, "resources", "evals", "evals.py"),
        os.path.join(openai_sdk_file_path, "resources", "responses", "responses.py"),
        os.path.join(openai_sdk_file_path, "resources", "images.py"),
        os.path.join(openai_sdk_file_path, "resources", "embeddings.py"),
        os.path.join(openai_sdk_file_path, "resources", "files.py"),
    ]

    items = []
    for file_path in file_paths:
        items.append({"input": open(file_path, "r").read()})
    if limit:
        return items[:limit]
    return items


structured_output_grader = """
您是一个乐于助人的助手,用于评估从代码文件中提取的信息的质量。
您将获得一个代码文件和一份提取的信息列表。
您应该评估提取信息的质量。

您应该以 1 到 7 的等级进行评分。
您应该应用以下标准,并按如下方式计算您的分数:
您应该首先检查完整性,评分范围为 1 到 7。
然后您应该应用质量修饰符。

质量修饰符是您乘以完整性分数 0 到 1 的乘数。
如果完整性覆盖率为 100%,并且所有内容质量都很高,则返回 7*1。
如果完整性覆盖率为 100%,但所有内容质量都很低,则返回 7*0.5。
依此类推。
"""

structured_output_grader_user_prompt = """
<代码文件>
{{item.input}}
</代码文件>

<提取的信息>
{{sample.output_json.symbols}}
</提取的信息>
"""

logs_eval = client.evals.create(
    name="Code QA Eval",
    data_source_config={
        "type": "custom",
        "item_schema": {
            "type": "object",
            "properties": {"input": {"type": "string"}},
        },
        "include_sample_schema": True,
    },
    testing_criteria=[
        {
            "type": "score_model",
            "name": "General Evaluator",
            "model": "o3",
            "input": [
                {"role": "system", "content": structured_output_grader},
                {"role": "user", "content": structured_output_grader_user_prompt},
            ],
            "range": [1, 7],
            "pass_threshold": 5.5,
        }
    ],
)

启动模型运行

在这里,我们针对同一个评估启动两个运行:一个调用Completions端点,一个调用Responses端点。

### Kick off model runs
gpt_4one_completions_run = client.evals.runs.create(
    name="gpt-4.1",
    eval_id=logs_eval.id,
    data_source={
        "type": "completions",
        "source": {
            "type": "file_content",
            "content": [{"item": item} for item in get_dataset(limit=1)],
        },
        "input_messages": {
            "type": "template",
            "template": [
                {
                    "type": "message",
                    "role": "system",
                    "content": {"type": "input_text", "text": "You are a helpful assistant."},
                },
                {
                    "type": "message",
                    "role": "user",
                    "content": {
                        "type": "input_text",
                        "text": "Extract the symbols from the code file {{item.input}}",
                    },
                },
            ],
        },
        "model": "gpt-4.1",
        "sampling_params": {
            "seed": 42,
            "temperature": 0.7,
            "max_completions_tokens": 10000,
            "top_p": 0.9,
            "response_format": {
                "type": "json_schema",
                "json_schema": {
                    "name": "python_symbols",
                    "schema": {
                        "type": "object",
                        "properties": {
                            "symbols": {
                                "type": "array",
                                "description": "A list of symbols extracted from Python code.",
                                "items": {
                                    "type": "object",
                                    "properties": {
                                        "name": {"type": "string", "description": "The name of the symbol."},
                                        "symbol_type": {
                                            "type": "string", "description": "The type of the symbol, e.g., variable, function, class.",
                                        },
                                    },
                                    "required": ["name", "symbol_type"],
                                    "additionalProperties": False,
                                },
                            }
                        },
                        "required": ["symbols"],
                        "additionalProperties": False,
                    },
                    "strict": True,
                },
            },
        },
    },
)

gpt_4one_responses_run = client.evals.runs.create(
    name="gpt-4.1-mini",
    eval_id=logs_eval.id,
    data_source={
        "type": "responses",
        "source": {
            "type": "file_content",
            "content": [{"item": item} for item in get_dataset(limit=1)],
        },
        "input_messages": {
            "type": "template",
            "template": [
                {
                    "type": "message",
                    "role": "system",
                    "content": {"type": "input_text", "text": "You are a helpful assistant."},
                },
                {
                    "type": "message",
                    "role": "user",
                    "content": {
                        "type": "input_text",
                        "text": "Extract the symbols from the code file {{item.input}}",
                    },
                },
            ],
        },
        "model": "gpt-4.1-mini",
        "sampling_params": {
            "seed": 42,
            "temperature": 0.7,
            "max_completions_tokens": 10000,
            "top_p": 0.9,
            "text": {
                "format": {
                    "type": "json_schema",
                    "name": "python_symbols",
                    "schema": {
                        "type": "object",
                        "properties": {
                            "symbols": {
                                "type": "array",
                                "description": "A list of symbols extracted from Python code.",
                                "items": {
                                    "type": "object",
                                    "properties": {
                                        "name": {"type": "string", "description": "The name of the symbol."},
                                        "symbol_type": {
                                            "type": "string",
                                            "description": "The type of the symbol, e.g., variable, function, class.",
                                        },
                                    },
                                    "required": ["name", "symbol_type"],
                                    "additionalProperties": False,
                                },
                            }
                        },
                        "required": ["symbols"],
                        "additionalProperties": False,
                    },
                    "strict": True,
                },
            },
        },
    },
)

实用轮询器

接下来,我们将使用一个简单的循环,等待所有运行完成,然后将每个运行的 JSON 保存到磁盘,以便您以后检查或将其附加到 CI 工件。

### Utility poller
def poll_runs(eval_id, run_ids):
    while True:
        runs = [client.evals.runs.retrieve(rid, eval_id=eval_id) for rid in run_ids]
        for run in runs:
            print(run.id, run.status, run.result_counts)
        if all(run.status in {"completed", "failed"} for run in runs):
            # dump results to file
            for run in runs:
                with open(f"{run.id}.json", "w") as f:
                    f.write(
                        client.evals.runs.output_items.list(
                            run_id=run.id, eval_id=eval_id
                        ).model_dump_json(indent=4)
                    )
            break
        time.sleep(5)

poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])

evalrun_68487dcc749081918ec2571e76cc9ef6 completed ResultCounts(errored=0, failed=1, passed=0, total=1) evalrun_68487dcdaba0819182db010fe5331f2e completed ResultCounts(errored=0, failed=1, passed=0, total=1)

加载输出以供快速检查

我们将获取两个运行的输出项,以便打印或后处理它们。

completions_output = client.evals.runs.output_items.list(
    run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id
)

responses_output = client.evals.runs.output_items.list(
    run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id
)

人类可读的转储

让我们并排打印完成与响应。

from IPython.display import display, HTML

# Collect outputs for both runs
completions_outputs = [item.sample.output[0].content for item in completions_output]
responses_outputs = [item.sample.output[0].content for item in responses_output]

# Create DataFrame for side-by-side display (truncated to 250 chars for readability)
df = pd.DataFrame({
    "Completions Output": [c[:250].replace('\n', ' ') + ('...' if len(c) > 250 else '') for c in completions_outputs],
    "Responses Output": [r[:250].replace('\n', ' ') + ('...' if len(r) > 250 else '') for r in responses_outputs]
})

# Custom color scheme
custom_styles = [
    {'selector': 'th', 'props': [('font-size', '1.1em'), ('background-color', '#323C50'), ('color', '#FFFFFF'), ('border-bottom', '2px solid #1CA7EC')]},
    {'selector': 'td', 'props': [('font-size', '1em'), ('max-width', '650px'), ('background-color', '#F6F8FA'), ('color', '#222'), ('border-bottom', '1px solid #DDD')]},
    {'selector': 'tr:hover td', 'props': [('background-color', '#D1ECF1'), ('color', '#18647E')]},
    {'selector': 'tbody tr:nth-child(even) td', 'props': [('background-color', '#E8F1FB')]},
    {'selector': 'tbody tr:nth-child(odd) td', 'props': [('background-color', '#F6F8FA')]},
    {'selector': 'table', 'props': [('border-collapse', 'collapse'), ('border-radius', '6px'), ('overflow', 'hidden')]},
]

styled = (
    df.style
    .set_properties(**{'white-space': 'pre-wrap', 'word-break': 'break-word', 'padding': '8px'})
    .set_table_styles(custom_styles)
    .hide(axis="index")
)

display(HTML("""
<h4 style="color: #1CA7EC; font-weight: 600; letter-spacing: 1px; text-shadow: 0 1px 2px rgba(0,0,0,0.08), 0 0px 0px #fff;">
Completions vs Responses Output
</h4>
"""))
display(styled)

Completions vs Responses Output

Completions Output Responses Output
{"symbols":[{"name":"Evals","symbol_type":"class"},{"name":"AsyncEvals","symbol_type":"class"},{"name":"EvalsWithRawResponse","symbol_type":"class"},{"name":"AsyncEvalsWithRawResponse","symbol_type":"class"},{"name":"EvalsWithStreamingResponse","symb... {"symbols":[{"name":"Evals","symbol_type":"class"},{"name":"runs","symbol_type":"property"},{"name":"with_raw_response","symbol_type":"property"},{"name":"with_streaming_response","symbol_type":"property"},{"name":"create","symbol_type":"function"},{...

可视化结果

以下可视化表示了结构化 QA 评估的评估数据和代码输出。这些图像提供了对数据分布和评估工作流程的见解。


评估数据概览

评估数据第 1 部分

评估数据第 2 部分


评估代码工作流程

评估代码结构


通过查看这些可视化,您可以更好地理解评估数据集的结构以及评估 QA 任务结构化输出的步骤。

用例 2:多语言情感提取

同样,让我们使用结构化输出来评估多语言情感提取模型。

# Sample in-memory dataset for sentiment extraction
sentiment_dataset = [
    {
        "text": "I love this product!",
        "channel": "twitter",
        "language": "en"
    },
    {
        "text": "This is the worst experience I've ever had.",
        "channel": "support_ticket",
        "language": "en"
    },
    {
        "text": "It's okay – not great but not bad either.",
        "channel": "app_review",
        "language": "en"
    },
    {
        "text": "No estoy seguro de lo que pienso sobre este producto.",
        "channel": "facebook",
        "language": "es"
    },
    {
        "text": "总体来说,我对这款产品很满意。",
        "channel": "wechat",
        "language": "zh"
    },
]
# Define output schema
sentiment_output_schema = {
    "type": "object",
    "properties": {
        "sentiment": {
            "type": "string",
            "description": "overall label: positive / negative / neutral"
        },
        "confidence": {
            "type": "number",
            "description": "confidence score 0-1"
        },
        "emotions": {
            "type": "array",
            "description": "list of dominant emotions (e.g. joy, anger)",
            "items": {"type": "string"}
        }
    },
    "required": ["sentiment", "confidence", "emotions"],
    "additionalProperties": False
}

# Grader prompts
sentiment_grader_system = """You are a strict grader for sentiment extraction.
Given the text and the model's JSON output, score correctness on a 1-5 scale."""

sentiment_grader_user = """Text: {{item.text}}
Model output:
{{sample.output_json}}
"""
# Register an eval for the richer sentiment task
sentiment_eval = client.evals.create(
    name="sentiment_extraction_eval",
    data_source_config={
        "type": "custom",
        "item_schema": {          # matches the new dataset fields
            "type": "object",
            "properties": {
                "text": {"type": "string"},
                "channel": {"type": "string"},
                "language": {"type": "string"},
            },
            "required": ["text"],
        },
        "include_sample_schema": True,
    },
    testing_criteria=[
        {
            "type": "score_model",
            "name": "Sentiment Grader",
            "model": "o3",
            "input": [
                {"role": "system", "content": sentiment_grader_system},
                {"role": "user",   "content": sentiment_grader_user},
            ],
            "range": [1, 5],
            "pass_threshold": 3.5,
        }
    ],
)
# Run the sentiment eval
sentiment_run = client.evals.runs.create(
    name="gpt-4.1-sentiment",
    eval_id=sentiment_eval.id,
    data_source={
        "type": "responses",
        "source": {
            "type": "file_content",
            "content": [{"item": item} for item in sentiment_dataset],
        },
        "input_messages": {
            "type": "template",
            "template": [
                {
                    "type": "message",
                    "role": "system",
                    "content": {"type": "input_text", "text": "You are a helpful assistant."},
                },
                {
                    "type": "message",
                    "role": "user",
                    "content": {
                        "type": "input_text",
                        "text": "{{item.text}}",
                    },
                },
            ],
        },
        "model": "gpt-4.1",
        "sampling_params": {
            "seed": 42,
            "temperature": 0.7,
            "max_completions_tokens": 100,
            "top_p": 0.9,
            "text": {
                "format": {
                    "type": "json_schema",
                    "name": "sentiment_output",
                    "schema": sentiment_output_schema,
                    "strict": True,
                },
            },
        },
    },
)

可视化评估数据

image

总结和后续步骤

在此笔记本中,我们演示了如何使用 OpenAI 评估 API 来评估模型在结构化输出任务上的性能。

后续步骤:

  • 我们鼓励您尝试使用自己的模型和数据集来使用该 API。
  • 您还可以浏览 API 文档以获取有关如何使用该 API 的更多详细信息。

有关更多信息,请参阅OpenAI Evals 文档