Evals API:图像输入
本指南演示了如何使用 OpenAI 的 Evals 框架处理基于图像的任务。通过利用 Evals API,我们将使用采样来生成模型响应,并使用模型评分(LLM 作为裁判)来根据图像、提示和参考答案对模型生成的响应进行评分。
在此示例中,我们将评估我们的模型在以下方面的能力:
- 对用户关于图像的提示生成适当的响应
- 与代表高质量响应的参考答案保持一致
安装依赖项 + 设置
# 安装所需的软件包
!pip install openai datasets pandas --quiet
# 导入库
from datasets import load_dataset
from openai import OpenAI
import os
import json
import time
import pandas as pd
数据集准备
我们使用托管在 Hugging Face 上的 VibeEval 数据集。它包含用户提示、配套图像和参考答案数据。首先,我们加载数据集。
dataset = load_dataset("RekaAI/VibeEval")
我们提取相关字段并将其放入类似 JSON 的格式,以作为 Evals API 的数据源。输入图像数据可以是网络 URL 或 base64 编码字符串。在这里,我们使用提供的网络 URL。
evals_data_source = []
# 选择数据集中前 3 个示例用于此指南
for example in dataset["test"].select(range(3)):
evals_data_source.append({
"item": {
"media_url": example["media_url"], # 图像网络 URL
"reference": example["reference"], # 参考答案
"prompt": example["prompt"] # 提示
}
})
如果打印数据源列表,每个项目应与以下格式类似:
{
"item": {
"media_url": "https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_food1_7e5c2cb9c8200d70.jpg"
"reference": "This appears to be a classic Margherita pizza, which has the following ingredients..."
"prompt": "What ingredients do I need to make this?"
}
}
Eval 配置
现在我们有了数据源和任务,我们将创建我们的 evals。有关 OpenAI Evals API 文档,请访问 API 文档。
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
Evals 分为两部分:“Eval”和“Run”。在“Eval”中,我们定义数据的预期结构和测试标准(评分器)。
数据源配置
根据我们编译的数据,我们的数据源配置如下:
data_source_config = {
"type": "custom",
"item_schema": {
"type": "object",
"properties": {
"media_url": { "type": "string" },
"reference": { "type": "string" },
"prompt": { "type": "string" }
},
"required": ["media_url", "reference", "prompt"]
},
"include_sample_schema": True, # 启用采样
}
测试标准
对于我们的测试标准,我们设置了评分器配置。在此示例中,它是一个模型评分器,它接收图像、参考答案和模型生成的响应(在 sample
命名空间中),然后根据模型响应与参考答案的匹配程度以及其对对话的总体适用性,输出 0 到 1 之间的分数。有关模型评分器的更多信息,请访问 API 评分器文档。
正确获取数据和评分器是有效评估的关键。因此,您可能需要迭代地优化评分器的提示。
注意:图像 URL 字段/模板需要放置在输入图像对象中才能被解释为图像。否则,图像将被解释为文本字符串。
grader_config = {
"type": "score_model",
"name": "Score Model Grader",
"input":[
{
"role": "system",
"content": "You are an expert grader. Judge how well the model response suits the image and prompt as well as matches the meaniing of the reference answer. Output a score of 1 if great. If it's somewhat compatible, output a score around 0.5. Otherwise, give a score of 0."
},
{
"role": "user",
"content": [{ "type": "input_text", "text": "Prompt: {{ item.prompt }}."},
{ "type": "input_image", "image_url": "{{ item.media_url }}", "detail": "auto" },
{ "type": "input_text", "text": "Reference answer: {{ item.reference }}. Model response: {{ sample.output_text }}."}
]
}
],
"pass_threshold": 0.9,
"range": [0, 1],
"model": "o4-mini" # 用于评分的模型;请检查您使用的模型是否支持图像输入
}
现在,我们创建 eval 对象。
eval_object = client.evals.create(
name="Image Grading",
data_source_config=data_source_config,
testing_criteria=[grader_config],
)
Eval 运行
要创建运行,我们传入 eval 对象 ID、数据源(即我们之前编译的数据)以及用于生成模型响应的采样消息输入。请注意,EvalsAPI 还支持包含图像的已存储的 completions 和响应作为数据源。有关更多信息,请参阅 附加信息:日志数据源 部分。
这是我们将用于此示例的采样消息输入。
sampling_messages = [{
"role": "user",
"type": "message",
"content": {
"type": "input_text",
"text": "{{ item.prompt }}"
}
},
{
"role": "user",
"type": "message",
"content": {
"type": "input_image",
"image_url": "{{ item.media_url }}",
"detail": "auto"
}
}]
我们现在启动一个 eval 运行。
eval_run = client.evals.runs.create(
name="Image Input Eval Run",
eval_id=eval_object.id,
data_source={
"type": "responses", # 使用 responses API 进行采样
"source": {
"type": "file_content",
"content": evals_data_source
},
"model": "gpt-4o-mini", # 用于生成响应的模型;请检查您使用的模型是否支持图像输入
"input_messages": {
"type": "template",
"template": sampling_messages}
}
)
轮询和显示结果
运行完成后,我们可以查看结果。您也可以在您所在组织的 OpenAI evals 仪表板中查看进度和结果。
while True:
run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)
if run.status == "completed" or run.status == "failed": # 检查运行是否完成
output_items = list(client.evals.runs.output_items.list(
run_id=run.id, eval_id=eval_object.id
))
df = pd.DataFrame({
"prompt": [item.datasource_item["prompt"]for item in output_items],
"reference": [item.datasource_item["reference"] for item in output_items],
"model_response": [item.sample.output[0].content for item in output_items],
"grading_results": [item.results[0]["sample"]["output"][0]["content"]
for item in output_items]
})
display(df)
break
time.sleep(5)
prompt | reference | model_response | grading_results | |
---|---|---|---|---|
0 | Please provide latex code to replicate this table | Below is the latex code for your table: ```te... | Certainly! Below is the LaTeX code to replicat... | {"steps":[{"description":"Assess if the provid... |
1 | What ingredients do I need to make this? | This appears to be a classic Margherita pizza,... | To make a classic Margherita pizza like the on... | {"steps":[{"description":"Check if model ident... |
2 | Is this safe for a vegan to eat? | Based on the image, this dish appears to be a ... | To determine if the dish is safe for a vegan t... | {"steps":[{"description":"Compare model respon... |
查看单个输出项
要查看完整的输出项,我们可以这样做。输出项的结构在 API 文档 此处 中指定。
first_item = output_items[0]
print(json.dumps(dict(first_item), indent=2, default=str))
{
"id": "outputitem_687833f102ec8191a6e53a5461b970c2",
"created_at": 1752708081,
"datasource_item": {
"prompt": "Please provide latex code to replicate this table",
"media_url": "https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_table0_b312eea68bcd0de6.png",
"reference": "Below is the latex code for your table:\n```tex\n\\begin{table}\n\\begin{tabular}{c c c c} \\hline & \\(S2\\) & Expert & Layman & PoelM \\\\ \\cline{2-4} \\(S1\\) & Expert & \u2013 & 54.0 & 62.7 \\\\ & Layman & 46.0 & \u2013 & 60.7 \\\\ &,PoelM,LM,LM,LM,LM,LM,,L,M,,L,M,,L,M,,L,M,,,\u2013&39.3 \\\\\n[-1ex] \\end{tabular}\n\\end{table}\n```."
},
"datasource_item_id": 1,
"eval_id": "eval_687833d68e888191bc4bd8b965368f22",
"object": "eval.run.output_item",
"results": [
{
"name": "Score Model Grader-73fe48a0-8090-46eb-aa8e-d426ad074eb3",
"sample": {
"input": [
{
"role": "system",
"content": "You are an expert grader. Judge how well the model response suits the image and prompt as well as matches the meaniing of the reference answer. Output a score of 1 if great. If it's somewhat compatible, output a score around 0.5. Otherwise, give a score of 0."
},
{
"role": "user",
"content": "Prompt: Please provide latex code to replicate this table. <image>https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_table0_b312eea68bcd0de6.png</image> Reference answer: Below is the latex code for your table:\n```tex\n\\begin{table}\n\\begin{tabular}{c c c c} \\hline & \\(S2\\) & Expert & Layman & PoelM \\\\ \\cline{2-4} \\(S1\\) & Expert & \u2013 & 54.0 & 62.7 \\\\ & Layman & 46.0 & \u2013 & 60.7 \\\\ &,PoelM,LM,LM,LM,LM,LM,,L,M,,L,M,,L,M,,L,M,,,\u2013&39.3 \\\\\n[-1ex] \\end{tabular}\n\\end{table}\n```.. Model response: Certainly! Below is the LaTeX code to replicate the table you provided:\n\n```latex\n\\documentclass{article}\n\\usepackage{array}\n\\usepackage{multirow}\n\\usepackage{booktabs}\n\n\\begin{document}\n\n\\begin{table}[ht]\n \\centering\n \\begin{tabular}{c|c|c|c}\n \\multirow{2}{*}{S1} & \\multirow{2}{*}{S2} & \\multicolumn{3}{c}{Methods} \\\\ \n \\cline{3-5}\n & & Expert & Layman & PoeLM \\\\\n \\hline\n Expert & & - & 54.0 & 62.7 \\\\\n Layman & & 46.0 & - & 60.7 \\\\\n PoeLM & & 37.3 & 39.3 & - \\\\\n \\end{tabular}\n \\caption{Comparison of different methods}\n \\label{tab:methods_comparison}\n\\end{table}\n\n\\end{document}\n```\n\n### Explanation:\n- The `multirow` package is used to create the multi-row header for `S1` and `S2`.\n- The `booktabs` package is used for improved table formatting (with `\\hline` for horizontal lines).\n- Adjust the table's caption and label as needed.."
}
],
"output": [
{
"role": "assistant",
"content": "{\"steps\":[{\"description\":\"Assess if the provided LaTeX code correctly matches the structure of the target table, including the diagonal header, column counts, and alignment.\",\"conclusion\":\"The code fails to create the diagonal split between S1 and S2 and mismatches column counts (defines 4 columns but uses 5).\"},{\"description\":\"Check the header layout: the target table has a single diagonal cell spanning two axes and three following columns labeled Expert, Layman, PoeLM. The model uses \\\\multirow and a \\\\multicolumn block named 'Methods', which does not replicate the diagonal or correct labeling.\",\"conclusion\":\"Header structure is incorrect and does not match the prompt's table.\"},{\"description\":\"Verify the data rows: the model code includes two empty cells after S1 and before the data, misaligning all data entries relative to the intended columns.\",\"conclusion\":\"Data rows are misaligned due to incorrect column definitions.\"},{\"description\":\"Overall compatibility: the code is syntactically flawed for the target table and conceptually does not replicate the diagonal header or correct column count.\",\"conclusion\":\"The response does not satisfy the prompt.\"}],\"result\":0.0}"
}
],
"finish_reason": "stop",
"model": "o4-mini-2025-04-16",
"usage": {
"total_tokens": 2185,
"completion_tokens": 712,
"prompt_tokens": 1473,
"cached_tokens": 0
},
"error": null,
"seed": null,
"temperature": 1.0,
"top_p": 1.0,
"reasoning_effort": null,
"max_completions_tokens": 4096
},
"passed": false,
"score": 0.0
}
],
"run_id": "evalrun_687833dbadd081919a0f9fbfb817baf4",
"sample": "Sample(error=None, finish_reason='stop', input=[SampleInput(content='Please provide latex code to replicate this table', role='user'), SampleInput(content='<image>https://storage.googleapis.com/reka-annotate.appspot.com/vibe-eval/difficulty-normal_table0_b312eea68bcd0de6.png</image>', role='user')], max_completion_tokens=None, model='gpt-4o-mini-2024-07-18', output=[SampleOutput(content=\"Certainly! Below is the LaTeX code to replicate the table you provided:\\n\\n```latex\\n\\\\documentclass{article}\\n\\\\usepackage{array}\\n\\\\usepackage{multirow}\\n\\\\usepackage{booktabs}\\n\\n\\\\begin{document}\\n\\n\\\\begin{table}[ht]\\n \\\\centering\\n \\\\begin{tabular}{c|c|c|c}\\n \\\\multirow{2}{*}{S1} & \\\\multirow{2}{*}{S2} & \\\\multicolumn{3}{c}{Methods} \\\\\\\\ \\n \\\\cline{3-5}\\n & & Expert & Layman & PoeLM \\\\\\\\\\n \\\\hline\\n Expert & & - & 54.0 & 62.7 \\\\\\\\\\n Layman & & 46.0 & - & 60.7 \\\\\\\\\\n PoeLM & & 37.3 & 39.3 & - \\\\\\\\\\n \\\\end{tabular}\\n \\\\caption{Comparison of different methods}\\n \\\\label{tab:methods_comparison}\\n\\\\end{table}\\n\\n\\\\end{document}\\n```\\n\\n### Explanation:\\n- The `multirow` package is used to create the multi-row header for `S1` and `S2`.\\n- The `booktabs` package is used for improved table formatting (with `\\\\hline` for horizontal lines).\\n- Adjust the table's caption and label as needed.\", role='assistant')], seed=None, temperature=1.0, top_p=1.0, usage=SampleUsage(cached_tokens=0, completion_tokens=295, prompt_tokens=14187, total_tokens=14482), max_completions_tokens=4096)",
"status": "fail",
"_datasource_item_content_hash": "bb2090df47ea2ca0aa67337709ce2ff7382d639118d3358068b0cc7031c12f82"
}
附加信息:日志数据源
如前所述,EvalsAPI 支持包含图像的日志(即已存储的 completions 或响应)作为数据源。要使用此功能,请按如下方式更改您的 eval 配置:
Eval 创建
- 设置
data_source_config = { "type": "logs" }
- 修改
grader_config
中的模板以使用{{item.input}}
和/或{{sample.output_text}}
,分别表示日志的输入和输出
Eval 运行创建
- 在
data_source
字段中指定用于获取相应日志的过滤器(请参阅 文档 以获取更多信息)
结论
在本指南中,我们介绍了使用 OpenAI Evals API 的图像任务评估工作流程。通过同时在采样和模型评分中使用图像输入功能,我们能够为该任务简化我们的评估流程。
我们期待您将此扩展到您自己的基于图像的用例,无论是 OCR 准确性、图像生成评分等等!