结构化输出评估食谱
本笔记本将引导您完成一系列专注、可运行的示例,演示如何使用 OpenAI Evals 框架来测试、评分和迭代需要大型语言模型生成结构化输出的任务。
为什么这很重要?
生产系统通常依赖于 JSON、SQL 或特定于域的格式。依赖于抽查或临时调整提示会很快失效。相反,您可以将期望编码为自动化评估,让您的团队能够安全地发布,而不是建立在沙子上。
快速浏览
- 第 1 部分 – 先决条件:环境变量和包设置
- 第 2 部分 – 演练:代码符号提取:端到端演示,用于评估模型从源代码中提取函数和类名的能力。我们保持原始逻辑不变,仅在其周围添加文档。
- 第 3 部分 – 其他食谱:常见的生产模式草图,例如将情感提取作为额外的代码示例进行评估。
- 第 4 部分 – 结果探索:用于提取运行输出和深入研究失败的轻量级助手。
先决条件
- 安装依赖项(显示最低版本):
pip install --upgrade openai
- 进行身份验证,导出您的密钥:
export OPENAI_API_KEY="sk-..."
- 可选:如果您计划批量运行评估,请设置一个具有适当限制的组织级密钥。
用例 1:代码符号提取
目标是从 OpenAI SDK 内的 Python 文件中提取所有函数、类和常量符号。
对于每个文件,我们要求模型发出结构化 JSON,如下所示:
{
"symbols": [
{"name": "OpenAI", "kind": "class"},
{"name": "Evals", "kind": "module"},
...
]
}
然后,一个评分模型根据完整性(我们是否捕获了所有符号?)和质量(种类是否正确?)以 1-7 的等级进行评分。
使用自定义数据集评估代码质量提取
让我们通过一个示例来演练如何使用 OpenAI Evals 框架和自定义内存数据集来评估模型从代码中提取符号的能力。
初始化 SDK 客户端
使用我们上面导出的 OPENAI_API_KEY
创建一个 openai.OpenAI
客户端。没有它,任何东西都无法运行。
%pip install --upgrade openai pandas rich --quiet
import os
import time
import openai
from rich import print
import pandas as pd
client = openai.OpenAI(
api_key=os.getenv("OPENAI_API_KEY") or os.getenv("_OPENAI_API_KEY"),
)
[1m[ [0m [34;49mnotice [0m [1;39;49m] [0m [39;49m pip 有新版本可用: [0m [31;49m24.0 [0m [39;49m -> [0m [32;49m25.1.1 [0m
[1m[ [0m [34;49mnotice [0m [1;39;49m] [0m [39;49m 要更新,请运行: [0m [32;49mpip install --upgrade pip [0m
注意:您可能需要重新启动内核才能使用更新的包。
数据集工厂和评分标准
get_dataset
通过读取几个 SDK 文件来构建一个小的内存数据集。structured_output_grader
定义了一个详细的评估标准。client.evals.create(...)
将评估注册到平台。
def get_dataset(limit=None):
openai_sdk_file_path = os.path.dirname(openai.__file__)
file_paths = [
os.path.join(openai_sdk_file_path, "resources", "evals", "evals.py"),
os.path.join(openai_sdk_file_path, "resources", "responses", "responses.py"),
os.path.join(openai_sdk_file_path, "resources", "images.py"),
os.path.join(openai_sdk_file_path, "resources", "embeddings.py"),
os.path.join(openai_sdk_file_path, "resources", "files.py"),
]
items = []
for file_path in file_paths:
items.append({"input": open(file_path, "r").read()})
if limit:
return items[:limit]
return items
structured_output_grader = """
您是一个乐于助人的助手,用于评估从代码文件中提取的信息的质量。
您将获得一个代码文件和一份提取的信息列表。
您应该评估提取信息的质量。
您应该以 1 到 7 的等级进行评分。
您应该应用以下标准,并按如下方式计算您的分数:
您应该首先检查完整性,评分范围为 1 到 7。
然后您应该应用质量修饰符。
质量修饰符是您乘以完整性分数 0 到 1 的乘数。
如果完整性覆盖率为 100%,并且所有内容质量都很高,则返回 7*1。
如果完整性覆盖率为 100%,但所有内容质量都很低,则返回 7*0.5。
依此类推。
"""
structured_output_grader_user_prompt = """
<代码文件>
{{item.input}}
</代码文件>
<提取的信息>
{{sample.output_json.symbols}}
</提取的信息>
"""
logs_eval = client.evals.create(
name="Code QA Eval",
data_source_config={
"type": "custom",
"item_schema": {
"type": "object",
"properties": {"input": {"type": "string"}},
},
"include_sample_schema": True,
},
testing_criteria=[
{
"type": "score_model",
"name": "General Evaluator",
"model": "o3",
"input": [
{"role": "system", "content": structured_output_grader},
{"role": "user", "content": structured_output_grader_user_prompt},
],
"range": [1, 7],
"pass_threshold": 5.5,
}
],
)
启动模型运行
在这里,我们针对同一个评估启动两个运行:一个调用Completions端点,一个调用Responses端点。
### Kick off model runs
gpt_4one_completions_run = client.evals.runs.create(
name="gpt-4.1",
eval_id=logs_eval.id,
data_source={
"type": "completions",
"source": {
"type": "file_content",
"content": [{"item": item} for item in get_dataset(limit=1)],
},
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "system",
"content": {"type": "input_text", "text": "You are a helpful assistant."},
},
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "Extract the symbols from the code file {{item.input}}",
},
},
],
},
"model": "gpt-4.1",
"sampling_params": {
"seed": 42,
"temperature": 0.7,
"max_completions_tokens": 10000,
"top_p": 0.9,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "python_symbols",
"schema": {
"type": "object",
"properties": {
"symbols": {
"type": "array",
"description": "A list of symbols extracted from Python code.",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "The name of the symbol."},
"symbol_type": {
"type": "string", "description": "The type of the symbol, e.g., variable, function, class.",
},
},
"required": ["name", "symbol_type"],
"additionalProperties": False,
},
}
},
"required": ["symbols"],
"additionalProperties": False,
},
"strict": True,
},
},
},
},
)
gpt_4one_responses_run = client.evals.runs.create(
name="gpt-4.1-mini",
eval_id=logs_eval.id,
data_source={
"type": "responses",
"source": {
"type": "file_content",
"content": [{"item": item} for item in get_dataset(limit=1)],
},
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "system",
"content": {"type": "input_text", "text": "You are a helpful assistant."},
},
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "Extract the symbols from the code file {{item.input}}",
},
},
],
},
"model": "gpt-4.1-mini",
"sampling_params": {
"seed": 42,
"temperature": 0.7,
"max_completions_tokens": 10000,
"top_p": 0.9,
"text": {
"format": {
"type": "json_schema",
"name": "python_symbols",
"schema": {
"type": "object",
"properties": {
"symbols": {
"type": "array",
"description": "A list of symbols extracted from Python code.",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "The name of the symbol."},
"symbol_type": {
"type": "string",
"description": "The type of the symbol, e.g., variable, function, class.",
},
},
"required": ["name", "symbol_type"],
"additionalProperties": False,
},
}
},
"required": ["symbols"],
"additionalProperties": False,
},
"strict": True,
},
},
},
},
)
实用轮询器
接下来,我们将使用一个简单的循环,等待所有运行完成,然后将每个运行的 JSON 保存到磁盘,以便您以后检查或将其附加到 CI 工件。
### Utility poller
def poll_runs(eval_id, run_ids):
while True:
runs = [client.evals.runs.retrieve(rid, eval_id=eval_id) for rid in run_ids]
for run in runs:
print(run.id, run.status, run.result_counts)
if all(run.status in {"completed", "failed"} for run in runs):
# dump results to file
for run in runs:
with open(f"{run.id}.json", "w") as f:
f.write(
client.evals.runs.output_items.list(
run_id=run.id, eval_id=eval_id
).model_dump_json(indent=4)
)
break
time.sleep(5)
poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])
evalrun_68487dcc749081918ec2571e76cc9ef6 completed ResultCounts(errored=0, failed=1, passed=0, total=1) evalrun_68487dcdaba0819182db010fe5331f2e completed ResultCounts(errored=0, failed=1, passed=0, total=1)
加载输出以供快速检查
我们将获取两个运行的输出项,以便打印或后处理它们。
completions_output = client.evals.runs.output_items.list(
run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id
)
responses_output = client.evals.runs.output_items.list(
run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id
)
人类可读的转储
让我们并排打印完成与响应。
from IPython.display import display, HTML
# Collect outputs for both runs
completions_outputs = [item.sample.output[0].content for item in completions_output]
responses_outputs = [item.sample.output[0].content for item in responses_output]
# Create DataFrame for side-by-side display (truncated to 250 chars for readability)
df = pd.DataFrame({
"Completions Output": [c[:250].replace('\n', ' ') + ('...' if len(c) > 250 else '') for c in completions_outputs],
"Responses Output": [r[:250].replace('\n', ' ') + ('...' if len(r) > 250 else '') for r in responses_outputs]
})
# Custom color scheme
custom_styles = [
{'selector': 'th', 'props': [('font-size', '1.1em'), ('background-color', '#323C50'), ('color', '#FFFFFF'), ('border-bottom', '2px solid #1CA7EC')]},
{'selector': 'td', 'props': [('font-size', '1em'), ('max-width', '650px'), ('background-color', '#F6F8FA'), ('color', '#222'), ('border-bottom', '1px solid #DDD')]},
{'selector': 'tr:hover td', 'props': [('background-color', '#D1ECF1'), ('color', '#18647E')]},
{'selector': 'tbody tr:nth-child(even) td', 'props': [('background-color', '#E8F1FB')]},
{'selector': 'tbody tr:nth-child(odd) td', 'props': [('background-color', '#F6F8FA')]},
{'selector': 'table', 'props': [('border-collapse', 'collapse'), ('border-radius', '6px'), ('overflow', 'hidden')]},
]
styled = (
df.style
.set_properties(**{'white-space': 'pre-wrap', 'word-break': 'break-word', 'padding': '8px'})
.set_table_styles(custom_styles)
.hide(axis="index")
)
display(HTML("""
<h4 style="color: #1CA7EC; font-weight: 600; letter-spacing: 1px; text-shadow: 0 1px 2px rgba(0,0,0,0.08), 0 0px 0px #fff;">
Completions vs Responses Output
</h4>
"""))
display(styled)
Completions vs Responses Output
Completions Output | Responses Output |
---|---|
{"symbols":[{"name":"Evals","symbol_type":"class"},{"name":"AsyncEvals","symbol_type":"class"},{"name":"EvalsWithRawResponse","symbol_type":"class"},{"name":"AsyncEvalsWithRawResponse","symbol_type":"class"},{"name":"EvalsWithStreamingResponse","symb... | {"symbols":[{"name":"Evals","symbol_type":"class"},{"name":"runs","symbol_type":"property"},{"name":"with_raw_response","symbol_type":"property"},{"name":"with_streaming_response","symbol_type":"property"},{"name":"create","symbol_type":"function"},{... |
可视化结果
以下可视化表示了结构化 QA 评估的评估数据和代码输出。这些图像提供了对数据分布和评估工作流程的见解。
评估数据概览
评估代码工作流程
通过查看这些可视化,您可以更好地理解评估数据集的结构以及评估 QA 任务结构化输出的步骤。
用例 2:多语言情感提取
同样,让我们使用结构化输出来评估多语言情感提取模型。
# Sample in-memory dataset for sentiment extraction
sentiment_dataset = [
{
"text": "I love this product!",
"channel": "twitter",
"language": "en"
},
{
"text": "This is the worst experience I've ever had.",
"channel": "support_ticket",
"language": "en"
},
{
"text": "It's okay – not great but not bad either.",
"channel": "app_review",
"language": "en"
},
{
"text": "No estoy seguro de lo que pienso sobre este producto.",
"channel": "facebook",
"language": "es"
},
{
"text": "总体来说,我对这款产品很满意。",
"channel": "wechat",
"language": "zh"
},
]
# Define output schema
sentiment_output_schema = {
"type": "object",
"properties": {
"sentiment": {
"type": "string",
"description": "overall label: positive / negative / neutral"
},
"confidence": {
"type": "number",
"description": "confidence score 0-1"
},
"emotions": {
"type": "array",
"description": "list of dominant emotions (e.g. joy, anger)",
"items": {"type": "string"}
}
},
"required": ["sentiment", "confidence", "emotions"],
"additionalProperties": False
}
# Grader prompts
sentiment_grader_system = """You are a strict grader for sentiment extraction.
Given the text and the model's JSON output, score correctness on a 1-5 scale."""
sentiment_grader_user = """Text: {{item.text}}
Model output:
{{sample.output_json}}
"""
# Register an eval for the richer sentiment task
sentiment_eval = client.evals.create(
name="sentiment_extraction_eval",
data_source_config={
"type": "custom",
"item_schema": { # matches the new dataset fields
"type": "object",
"properties": {
"text": {"type": "string"},
"channel": {"type": "string"},
"language": {"type": "string"},
},
"required": ["text"],
},
"include_sample_schema": True,
},
testing_criteria=[
{
"type": "score_model",
"name": "Sentiment Grader",
"model": "o3",
"input": [
{"role": "system", "content": sentiment_grader_system},
{"role": "user", "content": sentiment_grader_user},
],
"range": [1, 5],
"pass_threshold": 3.5,
}
],
)
# Run the sentiment eval
sentiment_run = client.evals.runs.create(
name="gpt-4.1-sentiment",
eval_id=sentiment_eval.id,
data_source={
"type": "responses",
"source": {
"type": "file_content",
"content": [{"item": item} for item in sentiment_dataset],
},
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "system",
"content": {"type": "input_text", "text": "You are a helpful assistant."},
},
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "{{item.text}}",
},
},
],
},
"model": "gpt-4.1",
"sampling_params": {
"seed": 42,
"temperature": 0.7,
"max_completions_tokens": 100,
"top_p": 0.9,
"text": {
"format": {
"type": "json_schema",
"name": "sentiment_output",
"schema": sentiment_output_schema,
"strict": True,
},
},
},
},
)
可视化评估数据
总结和后续步骤
在此笔记本中,我们演示了如何使用 OpenAI 评估 API 来评估模型在结构化输出任务上的性能。
后续步骤:
- 我们鼓励您尝试使用自己的模型和数据集来使用该 API。
- 您还可以浏览 API 文档以获取有关如何使用该 API 的更多详细信息。
有关更多信息,请参阅OpenAI Evals 文档。