使用自定义数据集评估网络搜索质量

本 Notebook 演示了如何使用 OpenAI Evals 框架和自定义内存数据集来评估模型从网络检索正确答案的能力。

目标：

展示如何设置和运行网络搜索质量评估。
为评估 LLM 的信息检索能力提供模板。

环境设置

我们首先导入所需的库并配置 OpenAI 客户端。这确保我们能够访问 OpenAI API 和所有必要的评估工具。

# 更新 OpenAI 客户端
%pip install --upgrade openai --quiet

 [1m[ [0m [34;49mnotice [0m [1;39;49m] [0m [39;49m pip 的新版本可用： [0m [31;49m24.0 [0m [39;49m -> [0m [32;49m25.1.1 [0m
 [1m[ [0m [34;49mnotice [0m [1;39;49m] [0m [39;49m 要更新，请运行： [0m [32;49mpip install --upgrade pip [0m
注意：您可能需要重启内核才能使用更新的包。

import os
import time
import pandas as pd
from IPython.display import display

from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY") or os.getenv("_OPENAI_API_KEY"),
)

定义自定义评估数据集

我们定义了一个小型内存数据集，其中包含用于网络搜索评估的问答对。每个条目包含一个 query（用户的搜索提示）和一个 answer（预期的地面真实）。

提示： 您可以修改或扩展此数据集以适应您自己的用例或测试更广泛的搜索场景。

def get_dataset(limit=None):
    dataset = [
        {
            "query": "coolest person in the world, the 100m dash at the 2008 olympics was the best sports event of all time",
            "answer": "usain bolt",
        },
        {
            "query": "best library in the world, there is nothing better than a dataframe",
            "answer": "pandas",
        },
        {
            "query": "most fun place to visit, I am obsessed with the Philbrook Museum of Art",
            "answer": "tulsa, oklahoma",
        },
        {
            "query": "who created the python programming language, beloved by data scientists everywhere",
            "answer": "guido van rossum",
        },
        {
            "query": "greatest chess player in history, famous for the 1972 world championship",
            "answer": "bobby fischer",
        },
        {
            "query": "the city of lights, home to the eiffel tower and louvre museum",
            "answer": "paris",
        },
        {
            "query": "most popular search engine, whose name is now a verb",
            "answer": "google",
        },
        {
            "query": "the first man to walk on the moon, giant leap for mankind",
            "answer": "neil armstrong",
        },
        {
            "query": "groundbreaking electric car company founded by elon musk",
            "answer": "tesla",
        },
        {
            "query": "founder of microsoft, philanthropist and software pioneer",
            "answer": "bill gates",
        },
    ]
    return dataset[:limit] if limit else dataset

定义评分逻辑

为了评估模型的答案，我们使用基于 LLM 的通过/失败评分器：

通过/失败评分器： 一个基于 LLM 的评分器，用于检查模型答案（来自网络搜索）是否与预期答案（地面真实）匹配或包含正确信息。

最佳实践： 使用基于 LLM 的评分器可以灵活地评估开放式或模糊的响应。

pass_fail_grader = """
You are a helpful assistant that grades the quality of a web search.
You will be given a query and an answer.
You should grade the quality of the web search.

You should either say "pass" or "fail", if the query contains the answer.

"""

pass_fail_grader_user_prompt = """
<Query>
{{item.query}}
</Query>

<Web Search Result>
{{sample.output_text}}
</Web Search Result>

<Ground Truth>
{{item.answer}}
</Ground Truth>
"""

定义评估配置

我们现在使用 OpenAI Evals 框架配置评估。

此步骤指定：

评估名称和数据集。
每个项目的模式（每个问答对包含哪些字段）。
要使用的评分器（基于 LLM 的通过/失败）。
通过标准和标签。

最佳实践： 事先清晰地定义评估模式和评分逻辑可确保可重复性和透明度。

# 使用 OpenAI Evals 客户端创建评估定义。
logs_eval = client.evals.create(
    name="Web-Search Eval",
    data_source_config={
        "type": "custom",
        "item_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "answer": {"type": "string"},
            },
        },
        "include_sample_schema": True,
    },
    testing_criteria=[
        {
            "type": "label_model",
            "name": "Web Search Evaluator",
            "model": "o3",
            "input": [
                {
                    "role": "system",
                    "content": pass_fail_grader,
                },
                {
                    "role": "user",
                    "content": pass_fail_grader_user_prompt,
                },
            ],
            "passing_labels": ["pass"],
            "labels": ["pass", "fail"],
        }
    ],
)

运行模型并轮询以完成

我们现在为选定的模型（gpt-4.1 和 gpt-4.1-mini）运行评估。

启动评估运行后，我们进行轮询，直到它完成（completed 或 failed）。

最佳实践： 轮询并设置延迟可避免过多的 API 调用，并确保高效的资源利用。

# 启动 gpt-4.1 的评估运行（使用网络搜索）
gpt_4one_responses_run = client.evals.runs.create(
    name="gpt-4.1",
    eval_id=logs_eval.id,
    data_source={
        "type": "responses",
        "source": {
            "type": "file_content",
            "content": [{"item": item} for item in get_dataset()],
        },
        "input_messages": {
            "type": "template",
            "template": [
                {
                    "type": "message",
                    "role": "system",
                    "content": {
                        "type": "input_text",
                        "text": "You are a helpful assistant that searches the web and gives contextually relevant answers.",
                    },
                },
                {
                    "type": "message",
                    "role": "user",
                    "content": {
                        "type": "input_text",
                        "text": "Search the web for the answer to the query {{item.query}}",
                    },
                },
            ],
        },
        "model": "gpt-4.1",
        "sampling_params": {
            "seed": 42,
            "temperature": 0.7,
            "max_completions_tokens": 10000,
            "top_p": 0.9,
            "tools": [{"type": "web_search_preview"}],
        },
    },
)

# 启动 gpt-4.1-mini 的评估运行（使用网络搜索）
gpt_4one_mini_responses_run = client.evals.runs.create(
    name="gpt-4.1-mini",
    eval_id=logs_eval.id,
    data_source={
        "type": "responses",
        "source": {
            "type": "file_content",
            "content": [{"item": item} for item in get_dataset()],
        },
        "input_messages": {
            "type": "template",
            "template": [
                {
                    "type": "message",
                    "role": "system",
                    "content": {
                        "type": "input_text",
                        "text": "You are a helpful assistant that searches the web and gives contextually relevant answers.",
                    },
                },
                {
                    "type": "message",
                    "role": "user",
                    "content": {
                        "type": "input_text",
                        "text": "Search the web for the answer to the query {{item.query}}",
                    },
                },
            ],
        },
        "model": "gpt-4.1-mini",
        "sampling_params": {
            "seed": 42,
            "temperature": 0.7,
            "max_completions_tokens": 10000,
            "top_p": 0.9,
            "tools": [{"type": "web_search_preview"}],
        },
    },
)

# 同时轮询两个运行，直到它们完成或失败
def poll_runs(eval_id, run_ids):
    while True:
        runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]
        for run in runs:
            print(run.id, run.status, run.result_counts)
        if all(run.status in {"completed", "failed"} for run in runs):
            break
        time.sleep(5)

# 开始轮询运行直到完成
poll_runs(logs_eval.id, [gpt_4one_responses_run.id, gpt_4one_mini_responses_run.id])

evalrun_68477e0f56a481919eea5e7d8a04225e completed ResultCounts(errored=0, failed=1, passed=9, total=10)
evalrun_68477e712bb48191bc7368b084f8c52c completed ResultCounts(errored=0, failed=0, passed=10, total=10)

显示和解释模型输出

最后，我们显示模型的输出以供手动检查和进一步分析。

每个答案都会针对数据集中的每个查询进行打印。
您可以将输出与预期答案进行比较，以评估质量、相关性和正确性。

# 检索 4.1 模型完成后的输出项
four_one = client.evals.runs.output_items.list(
    run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id
)

# 检索 4.1-mini 模型完成后的输出项
four_one_mini = client.evals.runs.output_items.list(
    run_id=gpt_4one_mini_responses_run.id, eval_id=logs_eval.id
)

# 收集两个模型的输出
four_one_outputs = [item.sample.output[0].content for item in four_one]
four_one_mini_outputs = [item.sample.output[0].content for item in four_one_mini]

# 创建 DataFrame 以进行并排显示
df = pd.DataFrame({
    "GPT-4.1 Output": four_one_outputs,
    "GPT-4.1-mini Output": four_one_mini_outputs
})

display(df)

	GPT-4.1 Output	GPT-4.1-mini Output
0	If you're captivated by the Philbrook Museum o...	Bobby Fischer is widely regarded as one of the...
1	\n## [Paris, France](https://www.google.com/ma...	The 2008 Olympic 100m dash is widely regarded ...
2	Bill Gates, born on October 28, 1955, in Seatt...	If you're looking for fun places to visit in T...
3	Usain Bolt's performance in the 100-meter fina...	On July 20, 1969, astronaut Neil Armstrong bec...
4	It seems you're interested in both the world's...	Bill Gates is a renowned software pioneer, phi...
5	Neil Armstrong was the first person to walk on...	Your statement, "there is nothing better than ...
6	Tesla, Inc. is an American electric vehicle an...	The search engine whose name has become synony...
7	Bobby Fischer, widely regarded as one of the g...	\n## [Paris, France](https://www.google.com/ma...
8	Guido van Rossum, a Dutch programmer born on J...	Guido van Rossum, a Dutch programmer born on J...
9	The most popular search engine whose name has ...	Elon Musk is the CEO and largest shareholder o...

您可以通过访问 https://platform.openai.com/evaluations 上的 evals dashboard 来可视化结果，如下图所示：

evals-websearch-dashboard

在此 Notebook 中，我们演示了使用 OpenAI Evals 框架评估语言模型网络搜索功能的工作流程。

涵盖的关键点：

定义了针对网络搜索评估的专注自定义数据集。
配置了基于 LLM 的评分器以进行稳健的评估。
使用最新的 OpenAI 模型和网络搜索工具运行了可重复的评估。
检索并显示了模型输出以供检查。

后续步骤和建议：

扩展数据集： 添加更多多样化和具有挑战性的查询，以更好地评估模型能力。
分析结果： 汇总通过/失败率，可视化性能或进行错误分析，以识别优势和劣势。
试验模型/工具： 尝试其他模型，调整工具配置，或在其他类型的信息检索任务上进行测试。
自动化报告： 生成摘要表或图表，以便于共享和决策。

有关更多信息，请参阅 OpenAI Evals 文档。