GPT-5 提示迁移与改进:使用新的提示优化器

GPT-5 模型系列是我们迄今为止发布的最智能的模型,代表了模型整体能力的一次飞跃。GPT-5 在代理任务性能、编码和可控性方面尤为专业,使其成为从好奇的用户到高级研究人员的理想选择。

GPT-5 将受益于所有传统的提示最佳实践,为了帮助您构建最佳提示,我们引入了 GPT-5 提示指南,解释如何充分利用其最先进的功能。此外,我们在 Playground 中引入了 GPT-5 特定提示优化器,以帮助用户开始改进现有提示迁移提示以适应 GPT-5 和其他 OpenAI 模型。

在本指南中,我们将介绍如何快速上手以解决 GPT-5 任务。我们将分享常见任务上的可衡量改进结果,并指导您如何使用提示优化器来实现相同的目标。

迁移和优化提示

在处理 LLM 时,精心设计有效的提示是一项关键技能。提示优化器的目标是为您的提示提供最适合我们模型的最佳实践和格式。优化器还消除了常见的提示失败模式,例如:

• 提示指令中的矛盾 • 缺失或不清晰的格式规范 • 提示与少样本示例之间的一致性

除了为目标模型调整提示外,优化器还能识别您要完成的具体任务,并应用关键实践来提升代理工作流、编码和多模态方面的性能。让我们通过一些前后对比来看看提示优化在哪些方面表现出色。

请记住,提示并非一刀切的体验,因此我们建议进行彻底的实验和迭代,以找到最适合您问题的解决方案。

确保您已将 OpenAI API 密钥设置为 OPENAI_API_KEY 并拥有 GPT-5 的访问权限

import os

required = ('OPENAI_API_KEY',)
missing = [k for k in required if not os.getenv(k)]
print('OPENAI_API_KEY is set!' if not missing else 'Missing environment variable: ' + ', '.join(missing) + '. Please set them before running the workflow.')
OPENAI_API_KEY is set!
## Let's install our required packages
%pip install -r requirements.txt --quiet

编码与分析:流式传输 Top‑K 频繁词

我们从一个模型在其中取得显著改进的领域开始:编码与分析。我们将要求模型生成一个 Python 脚本,该脚本使用特定的标记化规范,从大型文本流中计算出精确的 Top‑K 最频繁的标记。像这样的任务对不良提示非常敏感,因为它们可能将模型推向错误的算法和方法(近似草图与多遍/磁盘支持的精确解决方案),从而显著影响准确性和运行时。

对于此任务,我们将评估:

  1. 30 次运行的编译/执行成功率
  2. 平均运行时(成功运行)
  3. 平均峰值内存(成功运行)
  4. 精确性:输出匹配地面真实 Top‑K,并进行平局处理:按计数降序,然后按标记升序

注意:在 M4 Max MacBook Pro 上进行评估;如有需要,请调整约束。

我们的基线提示

在我们的示例中,让我们看一下一个典型的起始提示,其中包含一些轻微的提示矛盾模糊或未充分说明的指令。指令中的矛盾通常会降低性能并增加延迟,尤其是在 GPT-5 等推理模型中,而模糊的指令可能导致不期望的行为。

baseline_prompt = """
在 MacBook Pro (M4 Max) 上编写 Python 来解决任务。保持快速和轻量级。

- 优先使用标准库;如果外部包能使事情更简单,则使用它们。
- 以单次传递流式传输输入以保持内存低;如果能使解决方案更清晰,则重新读取或缓存。
- 争取精确结果;当近似方法在实践中不改变结果时,它们是可以的。
- 避免全局状态;公开一个方便的全局变量,如 top_k,以便于检查。
- 保持注释最少;在有帮助的地方添加简短的解释。
- 以自然、人性化的方式对结果进行排序;在适用的情况下遵循严格的平局规则。

仅在单个 Python 代码块中输出一个独立的 Python 脚本,包含所有导入,可运行。
"""

这个基线提示是您可以期望从要求 ChatGPT 编写提示,或与一位了解编码但并不特别关注您特定用例的朋友交谈中获得的。我们的基线提示故意缩短且更友好,但它隐藏了可能将模型推向不一致解决方案家族的混合信号。

首先,我们说要优先使用标准库,然后立即允许使用外部包“如果它们使事情更简单”。这种软许可可能促使模型使用非便携式依赖项或更重的导入,从而改变跨环境的性能甚至执行成功率。

接下来,我们鼓励单次流式传输以保持内存低,但我们也说“如果能使解决方案更清晰”,则可以使用重新读取或缓存。这种模糊性为多遍设计或内存缓存打开了大门,这些设计或缓存会破坏原始流式传输约束,并可能改变运行时和内存配置文件。

我们还要求精确结果,同时允许近似方法“当它们在实践中不改变结果时”。这是一个模型无法可靠验证的判断。它可能会引入草图或启发式方法,在 Top‑K 边界附近微妙地改变计数,从而产生看起来正确但无法通过严格评估的结果。

我们建议避免全局状态,但建议公开一个方便的全局变量,如 top_k。这混合了接口约定:函数应该返回数据,还是调用者应该读取全局变量?模型可能会同时实现两者,从而导致副作用,使评估和可重现性复杂化。

文档指南也类似地分裂:“保持注释最少”但“添加简短的解释”。根据模型如何解释这一点,您可能会得到解释不足的代码或与逻辑交织在一起的文本,这些文本有时会超出必需的输出格式。

最后,我们要求“自然、人性化”的排序,同时还提到了严格的平局规则。这些并不总是相同的。模型可能会选择方便的排序(例如 Counter.most_common),并偏离评估器的规范 (-count, token) 排序,尤其是在平局时——导致细微的正确性错误。

为什么这很重要:软化的约束使提示感觉易于满足,但它们会在道路上造成分叉。模型在不同运行中可能会选择不同的分支——标准库与外部依赖项、单次传递与重新读取/缓存、精确与近似——从而导致正确性、延迟和内存的可变性。

我们的评估器仍然严格:在小写文本上使用固定的标记化 [a-z0-9]+,并按 (-count, token) 进行确定性排序。任何在此处的偏差都会在其余解决方案看起来合理的情况下惩罚精确性。

让我们看看它的表现:使用基线提示生成 30 个代码脚本

使用 OpenAI Responses API,我们将调用模型 30 次,使用我们的基线提示,并将每个响应保存为 results_topk_baseline 中的 Python 文件。这可能需要一些时间。

from scripts.gen_baseline import generate_baseline_topk

MODEL = "gpt-5"
N_RUNS = 30
CONCURRENCY = 10
OUTPUT_DIR = "results_topk_baseline"

USER_PROMPT = """
任务:
给定全局变量 text (str) 和 k (int),生成 Top-K 最频繁的标记。

标记化:

- 使用 ASCII 正则表达式进行不区分大小写的标记化;生成小写标记。不需要对整个字符串进行小写处理。
- 标记是 ASCII [a-z0-9]+ 序列;将所有其他字符视为分隔符。

输出:

- 将 top_k 定义为 (标记, 计数) 元组的列表。
- 按计数降序,然后按标记升序排序。
- 长度 = min(k, 唯一标记的数量)。

注意事项:

- 使用提供的全局变量按原样运行;不进行文件或网络 I/O。
"""

generate_baseline_topk(
    model=MODEL,
    n_runs=N_RUNS,
    concurrency=CONCURRENCY,
    output_dir=OUTPUT_DIR,
    dev_prompt=baseline_prompt,
    user_prompt=USER_PROMPT,
)

评估生成的脚本 - 基线提示

然后,我们对 results_topk_baseline 中的每个脚本进行基准测试。在更大的数据集上,此评估故意很重,可能需要几分钟时间。

from scripts.topk_eval import evaluate_folder

evaluate_folder(
    folder_path="results_topk_baseline",
    k=500,
    scale_tokens=5_000_000,
    csv_path="run_results_topk_baseline.csv",
)

优化我们的提示

现在,让我们使用控制台中的提示优化工具来改进我们的提示,然后查看结果。我们可以通过访问 OpenAI 优化 Playground 并将现有提示粘贴到开发者消息部分来开始。

然后按“优化”按钮。这将打开优化面板。在此阶段,您可以提供希望反映在提示中的具体编辑,或者只需按“优化”即可根据目标模型和任务的最佳实践对其进行优化。首先,让我们只做这个。

optimize_image

完成后,您将看到提示优化的结果。在我们下面的示例中,您会看到对提示进行了许多更改。它还将提供更改内容的摘要以及更改的原因。您可以通过打开注释或使用内联审阅器模式与它们进行交互。

我们将添加一项额外的更改:

  • 强制单次流式传输

使用提示优化器的迭代过程可以轻松实现这一点。

optimize_image

一旦我们对优化后的提示版本满意,我们就可以使用优化器右上角的按钮将其保存为 Prompt Object。我们可以在 API 调用中使用此对象,这有助于未来的迭代、版本管理和跨不同应用程序的重用。

optimize_image

让我们看看它的表现:评估我们改进后的提示

为了可见性,我们将在此处提供我们新的优化提示,但您也可以传递 prompt_idversion。让我们开始编写我们的优化提示。

optimized_prompt = """# Objective
Generate a single, self-contained Python script that exactly solves the specified task on a MacBook Pro (M4 Max).

# Hard requirements

- Use only Python stdlib. No approximate algorithms.
- Tokenization: ASCII [a-z0-9]+ on the original text; match case-insensitively and lowercase tokens individually. Do NOT call text.lower() on the full string.
- Exact Top‑K semantics: sort by count desc, then token asc. No reliance on Counter.most_common tie behavior.
- Define `top_k` as a list of (token, count) tuples with length = min(k, number of unique tokens).
- When globals `text` (str) and `k` (int) exist, do not reassign them; set `top_k` from those globals. If you include a `__main__` demo, guard it to run only when globals are absent.
- No file I/O, stdin, or network access, except optionally printing `top_k` as the last line.

# Performance & memory constraints

- Do NOT materialize the entire token stream or any large intermediate list.
- Do NOT sort all unique (token, count) items unless k >= 0.3 * number_of_unique_tokens.
- When k < number_of_unique_tokens, compute Top‑K using a bounded min‑heap of size k over counts.items(), maintaining the correct tie-break (count desc, then token asc).
- Target peak additional memory beyond the counts dict to O(k). Avoid creating `items = sorted(counts.items(), ...)` for large unique sets.

# Guidance

- Build counts via a generator over re.finditer with re.ASCII | re.IGNORECASE; lowercase each matched token before counting.
- Prefer heapq.nsmallest(k, cnt.items(), key=lambda kv: (-kv[1], kv[0])) for exact selection without full sort; avoid heapq.nlargest.
- Do NOT wrap tokens in custom comparator classes (e.g., reverse-lex __lt__) or rely on tuple tricks for heap ordering.
- Keep comments minimal; include a brief complexity note (time and space).

# Output format

- Output only one Python code block; no text outside the block.

# Examples 
```python
import re, heapq
from collections import Counter
from typing import List, Tuple, Iterable

_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE)

def _tokens(s: str) -> Iterable[str]:
    # Case-insensitive match; lowercase per token to avoid copying the whole string
    for m in _TOKEN.finditer(s):
        yield m.group(0).lower()

def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]:
    if k <= 0:
        return []
    cnt = Counter(_tokens(text))
    u = len(cnt)
    key = lambda kv: (-kv[1], kv[0])
    if k >= u:
        return sorted(cnt.items(), key=key)
    # Exact selection with bounded memory
    return heapq.nsmallest(k, cnt.items(), key=key)

# Compute from provided globals when available; demo only if missing and running as main
try:
    text; k  # type: ignore[name-defined]
except NameError:
    if __name__ == "__main__":
        demo_text = "A a b b b c1 C1 c1 -- d! d? e"
        demo_k = 3
        top_k = top_k_tokens(demo_text, demo_k)
        print(top_k)
else:
    top_k = top_k_tokens(text, k)  # type: ignore[name-defined]
# Complexity: counting O(N tokens), selection O(U log k) via heapq.nsmallest; extra space O(U + k)

"""

Generating 30 code scripts with the Optimized prompt

from scripts.gen_optimized import generate_optimized_topk

MODEL = "gpt-5"
N_RUNS = 30
CONCURRENCY = 10
OUTPUT_DIR = "results_topk_optimized"

USER_PROMPT = """
Task:
Given globals text (str) and k (int), produce the Top-K most frequent tokens.

Tokenization:

- Case-insensitive tokenization using an ASCII regex; produce lowercase tokens. Whole-string lowercasing is not required.
- Tokens are ASCII [a-z0-9]+ sequences; treat all other characters as separators.

Output:

- Define top_k as a list of (token, count) tuples.
- Sort by count desc, then token asc.
- Length = min(k, number of unique tokens).

Notes:

- Run as-is with the provided globals; no file or network I/O.
"""

generate_optimized_topk(
    model=MODEL,
    n_runs=N_RUNS,
    concurrency=CONCURRENCY,
    output_dir=OUTPUT_DIR,
    dev_prompt=optimized_prompt,
    user_prompt=USER_PROMPT,
)

Evaluate Generated Scripts - Optimized Prompt

We run the same evaluation as above, but now with our optimized prompt to see if there were any improvements

from scripts.topk_eval import evaluate_folder

evaluate_folder(
    folder_path="results_topk_optimized",
    k=500,
    scale_tokens=5_000_000,
    csv_path="run_results_topk_optimized.csv",
)

Adding LLM-as-a-Judge Grading

Along with more quantitative evaluations, we can measure the models performance on more qualitative metrics like code quality, and task adherence. We have created a sample prompt for this called llm_as_judge.txt.

from scripts.llm_judge import judge_folder
# Run LLM-as-judge for baseline results
judge_folder(
    results_dir="results_topk_baseline",
    out_dir=None,  # auto-map to results_llm_as_judge_baseline
    model="gpt-5",
    system_prompt_path="llm_as_judge.txt",
    task_text=None,  # use default task description
    concurrency=6,
)
# Run LLM-as-judge for optimized results
judge_folder(
    results_dir="results_topk_optimized",
    out_dir=None,  # auto-map to results_llm_as_judge_optimized
    model="gpt-5",
    system_prompt_path="llm_as_judge.txt",
    task_text=None,
    concurrency=6,
)

Summarizing the results

We can now demonstrate from both a quantitative standpoint, along with a qualitative standpoint from our LLM as Judge results.

from pathlib import Path
import importlib
import scripts.results_summarizer as rs
from IPython.display import Markdown, display

importlib.reload(rs)

fig = rs.render_charts(
    quant_baseline=Path("results_topk_baseline")/"run_results_topk_baseline.csv",
    quant_optimized=Path("results_topk_optimized")/"run_results_topk_optimized.csv",
    judge_baseline=Path("results_llm_as_judge_baseline")/"judgement_summary.csv",
    judge_optimized=Path("results_llm_as_judge_optimized")/"judgement_summary.csv",
    auto_display=True,
    close_after=True,
)
md = rs.build_markdown_summary(
    quant_baseline=Path("results_topk_baseline")/"run_results_topk_baseline.csv",
    quant_optimized=Path("results_topk_optimized")/"run_results_topk_optimized.csv",
    judge_baseline=Path("results_llm_as_judge_baseline")/"judgement_summary.csv",
    judge_optimized=Path("results_llm_as_judge_optimized")/"judgement_summary.csv",
)

display(Markdown(md))

print(md)

png

Prompt Optimization Results - Coding Tasks

| Metric | Baseline | Optimized | Δ (Opt − Base) |

Metric Baseline Optimized Δ (Opt − Base)
Avg Time (s) 7.906 6.977 -0.929
Peak Memory (KB) 3626.3 577.5 -3048.8
Exact (%) 100.0 100.0 0.0
Sorted (%) 100.0 100.0 0.0
LLM Adherence (1–5) 4.40 4.90 +0.50
Code Quality (1–5) 4.73 4.90 +0.16
### Prompt Optimization Results - Coding Tasks

| Metric                      | Baseline | Optimized | Δ (Opt − Base) |

| Metric                      | Baseline | Optimized | Δ (Opt − Base) |
|----------------------------|---------:|----------:|---------------:|
| Avg Time (s)                |    7.906 |     6.977 |        -0.929 |
| Peak Memory (KB)            |   3626.3 |     577.5 |       -3048.8 |
| Exact (%)                   |    100.0 |     100.0 |           0.0 |
| Sorted (%)                  |    100.0 |     100.0 |           0.0 |
| LLM Adherence (1–5)         |     4.40 |      4.90 |         +0.50 |
| Code Quality (1–5)          |     4.73 |      4.90 |         +0.16 |

即使 GPT-5 已经生成了正确的代码,提示优化也收紧了约束并澄清了任何歧义。显示了对结果的总体改进!


上下文与检索:模拟金融问答

大多数生产用例都面临着不完美的查询和嘈杂的上下文。FailSafeQA 是一个出色的基准测试,它故意扰乱查询(拼写错误、不完整、非领域短语)和上下文(缺失、OCR 损坏或不相关的文档),并报告鲁棒性上下文基础合规性——即,当信号存在时模型能否回答,当信号不存在时能否弃权。

FailSafeQA diagram

链接

  • 论文 (arXiv):预期之外:金融领域的 FailSafe 长上下文 QA — https://arxiv.org/abs/2502.06329
  • 数据集 (Hugging Face):https://huggingface.co/datasets/Writer/FailSafeQA
  • 作者/制作者:Kiran Kamble、Melisa Russak、Dmytro Mozolevskyi、Muayad Ali、Mateusz Russak、Waseem AlShikh (Writer.ai) — 请参阅上面 arXiv 页面上的作者列表

我们将通过辅助脚本运行 FailSafeQA 评估,并并排比较基线提示与优化提示。

# Define the Baseline FailSafeQA system prompt here for reuse
baseline_prompt_fsqa = (
    "You are a finance QA assistant. Answer ONLY using the provided context.\n"
    "If the context is missing or irrelevant, politely refuse and state that you need the relevant document."
)

我们再次使用提示优化器来构建一个更适合此用例的新提示。借鉴长上下文问答的最佳实践,我们知道应该提醒我们的回答模型依赖于上下文部分的信息,并在上下文不足时拒绝回答。通过一次不带任何参数地使用优化按钮,我们可以获得一个合理的提示结构,并得到如下优化提示。

optimize_image

optimized_fsqa_prompt = """You are a finance document QA assistant.

Behavioral priorities (in order):

1) Grounding: Use ONLY the text inside [Context]. Do NOT use outside knowledge or assumptions.
2) Evidence check: Before answering, verify that the answer text (numbers, entities, dates, phrasing) is explicitly present or directly entailed by [Context]. If not, refuse (see Refusal policy).
3) Robustness to query noise: The user question may contain misspellings, missing words, or non-financial phrasing. Infer intent using the context and answer if the meaning is clear and supported by the context.
4) OCR noise handling: The context may include OCR artifacts (repeated characters, stray symbols, broken words). Ignore junk characters and reconstruct meaning when the underlying sentence is still recoverable. Do not guess beyond what the context supports.

Refusal policy:

- If [Context] is empty or lacks the information to answer, reply with a brief refusal and guidance. Do NOT attempt a general-knowledge answer.
- If the question is unrelated to the content of [Context] (out of scope), reply with a brief refusal and guidance. Do NOT speculate.
- If the question is incomplete but the correct answer is unambiguous from [Context], infer the intent and answer exactly; do NOT refuse.

Answer style:

- Default to the **shortest exact answer** needed to satisfy the question (e.g., the precise number/string/date as written). Preserve units, signs, casing, currency symbols, commas, and parentheses from the context. Do NOT round numbers unless asked.
- If the user explicitly asks to “write”, “draft”, or “generate” content, you may produce multi-sentence or formatted text—but still source every factual claim strictly from [Context].
- If the question is ambiguous, state the needed clarification in one short sentence, then provide the best supported answer if possible.

Output format:

- If answerable from the context:
  FINAL: <exact answer here>
  (optional) EVIDENCE: "<very short quoted span from the context that contains the answer>"

- If refusing:
  FINAL: Insufficient information in the provided context to answer this question. Please upload the relevant document or refine your question to include the necessary details."""

现在让我们运行我们的评估,为了演示,我们将显示单个比较的结果,但您也可以运行完整评估。注意:这将需要时间。

import importlib
import run_FailSafeQA
import pandas as pd
import matplotlib.pyplot as plt
from openai import OpenAI

# Ensure latest function signature is used after code edits
importlib.reload(run_FailSafeQA)
run_failsafeqa = run_FailSafeQA.run_failsafeqa

# Set idx to an integer for a quick single-example comparison; set to None for full run
idx = 0  # e.g., 0 for a single datapoint

#Helper functions:
class OpenAIAnswer:
    def __init__(self):
        self.client = OpenAI()

    def __call__(self, system_prompt: str, user_prompt: str, model: str) -> str:
        resp = self.client.responses.create(
            model=model,
            input=[
                {"role": "developer", "content": [{"type": "input_text", "text": system_prompt}]},
                {"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},
            ],
            text={"format": {"type": "text"}, "verbosity": "medium"},
            reasoning={"effort": "medium", "summary": "auto"},
            tools=[],
        )
        return resp.output_text
class OpenAIJudge:
    def __init__(self):
        self.client = OpenAI()

    def __call__(self, prompt: str, model: str) -> str:
        resp = self.client.responses.create(
            model=model,
            input=[{"role": "user", "content": [{"type": "input_text", "text": prompt}]}],
            text={"format": {"type": "text"}, "verbosity": "medium"},
            reasoning={"effort": "medium", "summary": "auto"},
            tools=[],
        )
        return resp.output_text

if idx is not None:
    # Single example mode (with detailed prompt/response logging)
    run_failsafeqa(
        out="results_failsafeqa_baseline.csv",
        system_prompt=baseline_prompt_fsqa,
        indices=[idx],
        log_prompts=True,
        log_chars=800,
        log_file="failsafeqa_debug.log",
    )
    run_failsafeqa(
        out="results_failsafeqa_optimized.csv",
        system_prompt=optimized_fsqa_prompt,
        indices=[idx],
        log_prompts=True,
        log_chars=800,
        log_file="failsafeqa_debug.log",
    )

    base_df = pd.read_csv("results_failsafeqa_baseline.csv")
    opt_df = pd.read_csv("results_failsafeqa_optimized.csv")

    b_one = base_df[base_df["idx"] == idx]
    o_one = opt_df[opt_df["idx"] == idx]

    comparison_df = pd.concat([b_one, o_one], ignore_index=True)

    # Keep only relevant columns
    comparison_df = comparison_df[["run", "kind", "rating", "compliance"]]

    # Display as table
    display(comparison_df)

else:
    # Full run mode
    run_failsafeqa(out="results_failsafeqa_baseline.csv", system_prompt=baseline_prompt_fsqa)
    run_failsafeqa(out="results_failsafeqa_optimized.csv", system_prompt=optimized_fsqa_prompt)

    base_df = pd.read_csv("results_failsafeqa_baseline.csv")
    opt_df = pd.read_csv("results_failsafeqa_optimized.csv")

    def per_kind_summary(df: pd.DataFrame) -> pd.DataFrame:
        out = df.groupby("kind").agg(
            mean_rating=("rating", lambda x: pd.to_numeric(x, errors="coerce").mean()),
            compliance_rate=("compliance", lambda x: pd.to_numeric(x, errors="coerce").fillna(0).mean()),
            count=("rating", "count"),
        )
        return out.round(3)

    base_summary = per_kind_summary(base_df)
    opt_summary = per_kind_summary(opt_df)

    summary = base_summary.join(opt_summary, lsuffix="_base", rsuffix="_opt").fillna("NA")

    print("Per-kind comparison (baseline vs optimized):")
    display(summary)

    # Plot compliance rate comparison per kind
    kinds = summary.index.tolist()
    x = range(len(kinds))
    base_vals = summary["compliance_rate_base"].astype(float).tolist()
    opt_vals = summary["compliance_rate_opt"].astype(float).tolist()

    fig, ax = plt.subplots(figsize=(10, 4))
    width = 0.35
    ax.bar([i - width/2 for i in x], base_vals, width=width, label="Baseline", color="#cbd5e1")
    ax.bar([i + width/2 for i in x], opt_vals, width=width, label="Optimized", color="#60a5fa")
    ax.set_xticks(list(x))
    ax.set_xticklabels(kinds, rotation=45, ha="right")
    ax.set_ylim(0, 1)
    ax.set_ylabel("Compliance rate")
    ax.set_title("FailSafeQA — Per-kind Compliance (Baseline vs Optimized)")
    ax.legend()
    plt.tight_layout()
    plt.show()

    # Overall metrics
    def overall(df: pd.DataFrame):
        return {
            "mean_rating": float(pd.to_numeric(df["rating"], errors="coerce").mean()),
            "mean_compliance": float(pd.to_numeric(df["compliance"], errors="coerce").fillna(0).mean()),
        }

    print("Overall — Baseline:", overall(base_df))
    print("Overall — Optimized:", overall(opt_df))
from IPython.display import Markdown, display

def build_markdown_summary_from_metrics(
    robust_base: float, ground_base: float,
    robust_opt: float, ground_opt: float,
    threshold: int = 6,
    src_base: str = "results_failsafeqa.csv",
    src_opt: str = "results_failsafeqa.csv",
) -> str:
    d_r = robust_opt - robust_base
    d_g = ground_opt - ground_base

    # Data rows
    rows = [
        ["Metric", "Baseline", "Optimized", "Δ (Opt − Base)"],
        ["Robustness (avg across datapoints)", f"{robust_base:.3f}", f"{robust_opt:.3f}", f"{d_r:+.3f}"],
        ["Context Grounding (avg across datapoints)", f"{ground_base:.3f}", f"{ground_opt:.3f}", f"{d_g:+.3f}"],
    ]

    # Calculate column widths for alignment
    col_widths = [max(len(str(row[i])) for row in rows) for i in range(len(rows[0]))]

    # Build table lines with padding
    lines = []
    for i, row in enumerate(rows):
        padded = [str(cell).ljust(col_widths[j]) for j, cell in enumerate(row)]
        lines.append("| " + " | ".join(padded) + " |")
        if i == 0:  # after header
            sep = ["-" * col_widths[j] for j in range(len(row))]
            lines.append("| " + " | ".join(sep) + " |")

    table = "\n".join(lines)

    return f"""
## FailSafeQA — Summary

**Compliance threshold:** ≥ {threshold}

{table}

_Source files:_ `{src_base}` · `{src_opt}`
""".strip()

# Usage
md = build_markdown_summary_from_metrics(
    robust_base=0.320, ground_base=0.800,
    robust_opt=0.540, ground_opt=0.950,
    threshold=6,
    src_base="results_failsafeqa.csv",
    src_opt="results_failsafeqa.csv",
)

# Notebook pretty
display(Markdown(md))

print(md)

FailSafeQA — Summary

Compliance threshold: ≥ 6

| Metric | Baseline | Optimized | Δ (Opt − Base) |

Metric Baseline Optimized Δ (Opt − Base)
Robustness (avg across datapoints) 0.320 0.540 +0.220
Context Grounding (avg across datapoints) 0.800 0.950 +0.150

Source files: results_failsafeqa.csv · results_failsafeqa.csv

## FailSafeQA — Summary

**Compliance threshold:** ≥ 6

| Metric                                    | Baseline | Optimized | Δ (Opt − Base) |

| Metric                                    | Baseline | Optimized | Δ (Opt − Base) |
| ----------------------------------------- | -------- | --------- | -------------- |
| Robustness (avg across datapoints)        | 0.320    | 0.540     | +0.220         |
| Context Grounding (avg across datapoints) | 0.800    | 0.950     | +0.150         |

_Source files:_ `results_failsafeqa.csv` · `results_failsafeqa.csv`

GPT-5-mini 在此任务上表现出色,因此即使是基线提示也几乎总能获得 >= 4 的分数。但是,如果我们比较法官评分的完美分数(6/6)的百分比,我们会发现优化提示在 FailSafeQA 的两个回答质量类别:鲁棒性和上下文基础方面,拥有更多的完美答案。

结论

我们很高兴大家能在 OpenAI Playground 中尝试GPT-5 的提示优化。GPT-5 带来了最先进的智能,而强大的提示有助于它更可靠地进行推理、遵循约束并产生更清晰、更高质量的结果。

立即在您的任务上试用 Prompt Optimizer