元提示：自动化提示优化指南

欢迎来到我们的元提示食谱！在本指南中，我们将探讨如何通过优化提示来提高语言模型的输出质量。我们将以总结新闻报道为例来说明这一过程。

元提示是一种使用 LLM 来生成或改进提示的技术。通常，这是通过使用更智能的模型来为不太智能的模型优化提示。这是一个使用提示来指导、构建和优化其他提示的过程，有助于确保它们能更有效地引导 LLM 产生高质量、相关的输出。我们将利用 o1-preview 的功能，这是一个具有高级推理能力、更智能的模型，用于改进 gpt-4o 的提示。

我们致力于通过这种技术，让您在使用 LLM 进行开发的过程中更加顺畅和便捷。别忘了在 Playground 中查看我们的生成一切功能 — 它是深入了解元提示的绝佳起点。

在本示例中，我们将从一个简单的总结新闻文章的提示开始，然后对其进行改进，以观察输出的改善情况。我们将使用 o1-preview 来分析和优化我们的提示，逐步添加更多细节和清晰度。最后，我们将系统地评估输出，以了解我们改进的影响。

import pandas as pd
import openai
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from pydantic import BaseModel
from datasets import load_dataset

client = openai.Client()

/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

导入数据

让我们从 HuggingFace 导入 bbc_news_alltime 数据集 https://huggingface.co/datasets/RealTimeData/bbc_news_alltime 开始。此数据集包含所有 BBC 新闻文章，涵盖了从 2017 年到最近一个完整月份的所有月度发布内容。在我们的实验中，我们将仅关注 2024 年 8 月的样本，以保持时效性和可管理性。

ds = load_dataset("RealTimeData/bbc_news_alltime", "2024-08")
df = pd.DataFrame(ds['train']).sample(n=100, random_state=1)
df.head()

	title	published_date	authors	description	section	content	link	top_image
2662	Laura Whitmore: I was gaslighted after raising...	2024-08-04	https://www.facebook.com/bbcnews	The former Love Island host said that things s...	Culture	Television presenter Laura Whitmore has said t...	http://www.bbc.co.uk/news/articles/c9wvwvzm7x7o	https://ichef.bbci.co.uk/ace/standard/2560/cps...
1865	Errollyn Wallen appointed as Master of the Kin...	2024-08-25	https://www.facebook.com/bbcnews	She is best known for her work on the 2012 Par...	Culture	Celebrated composer and singer-songwriter Erro...	http://www.bbc.co.uk/news/articles/c4gl758g7zgo	https://ichef.bbci.co.uk/ace/standard/2560/cps...
2554	SDLP: Matthew O'Toole endorses Claire Hanna fo...	2024-08-30	https://www.facebook.com/bbcnews	Matthew O'Toole had been named by some as a po...	Northern Ireland Politics	Matthew O'Toole leads his party's official opp...	http://www.bbc.co.uk/news/articles/cvg41j7xrzdo	https://ichef.bbci.co.uk/ace/standard/3840/cps...
1338	Rotherham rioters among those jailed - BBC News	2024-08-20	https://www.facebook.com/bbcnews	Two men who were part of a mob targeting a Hol...	South Yorkshire	Rotherham pair among those jailed for UK rioti...	http://www.bbc.co.uk/news/articles/cwywggd7qw6o	https://ichef.bbci.co.uk/ace/standard/2560/cps...
1232	BBC News - BBC iPlayer	2024-08-02	None	None	None	JavaScript seems to be disabled. Please enable...	http://www.bbc.co.uk/news/10318089

迭代提示

让我们从一个简单的提示开始，然后使用 o1-preview 来增强它以获得更好的结果。我们想总结新闻文章，所以这就是我将要求模型做的事情。

simple_prompt = "Summarize this news article: {article}"

为了改进提示，我们需要为 o1-preview 提供我们想要实现的目标和背景。然后，我们可以要求它生成一个更详细的提示，该提示将产生更丰富、更全面的新闻摘要。

meta_prompt = """
Improve the following prompt to generate a more detailed summary.
Adhere to prompt engineering best practices.
Make sure the structure is clear and intuitive and contains the type of news, tags and sentiment analysis.

{simple_prompt}

Only return the prompt.
"""

def get_model_response(messages, model="o1-preview"):
    response = client.chat.completions.create(
        messages=messages,
        model=model,
    )
    return response.choices[0].message.content


complex_prompt = get_model_response([{"role": "user", "content": meta_prompt.format(simple_prompt=simple_prompt)}])
complex_prompt

'Please read the following news article and provide a comprehensive summary that includes:\n\n1. **Type of News**: Specify the category of the news article (e.g., Politics, Technology, Health, Sports, etc.).\n2. **Summary**: Write a concise and clear summary of the main points, ensuring the structure is logical and intuitive.\n3. **Tags**: List relevant keywords or tags associated with the article.\n4. **Sentiment Analysis**: Analyze the overall sentiment of the article (positive, negative, or neutral) and briefly explain your reasoning.\n\n**Article:**\n\n{article}'

生成摘要

现在我们有了两个提示，让我们来生成摘要吧！对于我们数据集中的每一项，我们将同时使用简单提示和增强提示来比较它们。通过这样做，我们将亲身体验我们使用 o1-preview 进行的改进如何带来更丰富、更详细的摘要。让我们深入了解一下，亲自看看其中的区别！

def generate_response(prompt):
    messages = [{"role": "user", "content": prompt}]
    response = get_model_response(messages, model="gpt-4o-mini")
    return response

def generate_summaries(row):
    simple_itinerary = generate_response(simple_prompt.format(article=row["content"]))
    complex_itinerary = generate_response(complex_prompt + row["content"])
    return simple_itinerary, complex_itinerary

让我们检查一下一切是否看起来不错，以及我们是否可以为第一篇新闻报道生成摘要。

generate_summaries(df.iloc[0])

('Television presenter Laura Whitmore has spoken out about her experiences on Strictly Come Dancing, revealing that issues she attempted to address during her tenure on the show are now coming to light. In an interview with The Irish Times, she described feeling "gaslit" and suggested that her concerns, which she raised eight years ago, were not taken seriously at the time. Whitmore recalled that her participation left her feeling "broken" and criticized how she was portrayed during the show. She mentioned contributing evidence to an ongoing review involving incidents of alleged inappropriate behavior during her time on the show, although she did not make an official complaint. The BBC, which has been navigating its own controversy related to the treatment of contestants, stated it is taking these claims seriously and plans to enhance welfare measures on the show, including the introduction of a chaperone at rehearsals. Recent allegations from other celebrities have further intensified the scrutiny of Strictly Come Dancing.\n',
 '1. **Type of News**: Entertainment/Television\n\n2. **Summary**: Laura Whitmore, a television presenter, has revealed that issues she raised eight years ago during her time on Strictly Come Dancing are now surfacing, describing her experience as "gaslighting." She stated that her concerns about inappropriate behavior were not taken seriously at the time, leading her to feel "broken" and negatively portrayed. Whitmore is reportedly providing evidence for a BBC investigation into the show, though she has not filed an official complaint. The BBC is facing scrutiny over contestant treatment and plans to implement new welfare measures, including chaperones during rehearsals. Other celebrities have also made allegations, increasing the pressure on the show.\n\n3. **Tags**: Laura Whitmore, Strictly Come Dancing, BBC, allegations, inappropriate behavior, gaslighting, welfare measures, entertainment controversy\n\n4. **Sentiment Analysis**: The overall sentiment is negative. The article details serious allegations of mistreatment and inappropriate behavior, along with Whitmore\'s personal account of emotional distress and professional difficulties. The tone is critical and highlights a concerning atmosphere regarding contestant treatment in the entertainment industry.')

通过比较简单提示和增强提示生成的摘要，我们已经可以看到显著的改进。初始摘要提供了文章的总体概述，而增强摘要则更深入 — 它不仅提供了详细的摘要，还对新闻类型进行了分类，列出了相关标签，甚至包括了情感分析。

现在让我们在整个数据集上进行测试！

# Add new columns to the dataframe for storing itineraries
df['simple_summary'] = None
df['complex_summary'] = None

# Use ThreadPoolExecutor to generate itineraries concurrently
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(generate_summaries, row): index for index, row in df.iterrows()}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Generating Itineraries"):
        index = futures[future]
        simple_itinerary, complex_itinerary = future.result()
        df.at[index, 'simple_summary'] = simple_itinerary
        df.at[index, 'complex_summary'] = complex_itinerary

df.head()

Generating Itineraries: 100%|██████████| 100/100 [00:50<00:00,  1.98it/s]

	title	published_date	authors	description	section	content	link	top_image	simple_summary	complex_summary
2662	Laura Whitmore: I was gaslighted after raising...	2024-08-04	https://www.facebook.com/bbcnews	The former Love Island host said that things s...	Culture	Television presenter Laura Whitmore has said t...	http://www.bbc.co.uk/news/articles/c9wvwvzm7x7o	https://ichef.bbci.co.uk/ace/standard/2560/cps...	Television presenter Laura Whitmore has spoken...	1. Type of News: Entertainment/Television\...
1865	Errollyn Wallen appointed as Master of the Kin...	2024-08-25	https://www.facebook.com/bbcnews	She is best known for her work on the 2012 Par...	Culture	Celebrated composer and singer-songwriter Erro...	http://www.bbc.co.uk/news/articles/c4gl758g7zgo	https://ichef.bbci.co.uk/ace/standard/2560/cps...	Errollyn Wallen has been appointed Master of t...	1. Type of News: Arts/Music\n\n2. **Summar...
2554	SDLP: Matthew O'Toole endorses Claire Hanna fo...	2024-08-30	https://www.facebook.com/bbcnews	Matthew O'Toole had been named by some as a po...	Northern Ireland Politics	Matthew O'Toole leads his party's official opp...	http://www.bbc.co.uk/news/articles/cvg41j7xrzdo	https://ichef.bbci.co.uk/ace/standard/3840/cps...	Matthew O'Toole, the leader of the official op...	1. Type of News: Politics\n\n2. *Summary...
1338	Rotherham rioters among those jailed - BBC News	2024-08-20	https://www.facebook.com/bbcnews	Two men who were part of a mob targeting a Hol...	South Yorkshire	Rotherham pair among those jailed for UK rioti...	http://www.bbc.co.uk/news/articles/cwywggd7qw6o	https://ichef.bbci.co.uk/ace/standard/2560/cps...	Two men, Nathan Palmer (29) and Niven Matthewm...	1. Type of News: Politics / Crime and Just...
1232	BBC News - BBC iPlayer	2024-08-02	None	None	None	JavaScript seems to be disabled. Please enable...	http://www.bbc.co.uk/news/10318089		The article discusses the need to enable JavaS...	I cannot provide a summary of the article as t...

评估结果

为了评估两个提示在性能上的差异，我们将使用一种结构化的评估方法，让 LLM 充当裁判。这意味着我们将利用语言模型本身来评估和比较输出，并根据特定标准进行。

“LLM 作为裁判”是什么意思？

使用 LLM 作为裁判意味着让语言模型评估其自身的输出或其他模型的输出。它应用预定义的标准来评估准确性、清晰度和相关性等方面。这种方法有助于我们在没有人类偏见的情况下获得客观、一致的评估，从而更容易识别不同提示之间的改进。我们的食谱 OpenAI Evals 入门指南提供了如何开始使用此方法的一些线索。

这是我们将用于评估的提示：

evaluation_prompt = """
You are an expert editor tasked with evaluating the quality of a news article summary. Below is the original article and the summary to be evaluated:

**Original Article**:
{original_article}

**Summary**:
{summary}

Please evaluate the summary based on the following criteria, using a scale of 1 to 5 (1 being the lowest and 5 being the highest). Be critical in your evaluation and only give high scores for exceptional summaries:

1. **Categorization and Context**: Does the summary clearly identify the type or category of news (e.g., Politics, Technology, Sports) and provide appropriate context?
2. **Keyword and Tag Extraction**: Does the summary include relevant keywords or tags that accurately capture the main topics and themes of the article?
3. **Sentiment Analysis**: Does the summary accurately identify the overall sentiment of the article and provide a clear, well-supported explanation for this sentiment?
4. **Clarity and Structure**: Is the summary clear, well-organized, and structured in a way that makes it easy to understand the main points?
5. **Detail and Completeness**: Does the summary provide a detailed account that includes all necessary components (type of news, tags, sentiment) comprehensively?

Provide your scores and justifications for each criterion, ensuring a rigorous and detailed evaluation.
"""

class ScoreCard(BaseModel):
    justification: str
    categorization: int
    keyword_extraction: int
    sentiment_analysis: int
    clarity_structure: int
    detail_completeness: int

这是一个专业提示 — 您实际上可以使用元提示来改进您的评估提示！通过将相同的迭代改进应用于指导 LLM 充当裁判的提示，您可以使您的评估更加精确和富有洞察力。

让我们使用此提示来评估我们的摘要！

def evaluate_summaries(row):
    simple_messages = [{"role": "user", "content": evaluation_prompt.format(original_article=row["content"], summary=row['simple_summary'])}]
    complex_messages = [{"role": "user", "content": evaluation_prompt.format(original_article=row["content"], summary=row['complex_summary'])}]

    simple_summary = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=simple_messages,
        response_format=ScoreCard)
    simple_summary = simple_summary.choices[0].message.parsed

    complex_summary = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=complex_messages,
        response_format=ScoreCard)
    complex_summary = complex_summary.choices[0].message.parsed

    return simple_summary, complex_summary

# Add new columns to the dataframe for storing evaluations
df['simple_evaluation'] = None
df['complex_evaluation'] = None

# Use ThreadPoolExecutor to evaluate itineraries concurrently
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(evaluate_summaries, row): index for index, row in df.iterrows()}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Evaluating Summaries"):
        index = futures[future]
        simple_evaluation, complex_evaluation = future.result()
        df.at[index, 'simple_evaluation'] = simple_evaluation
        df.at[index, 'complex_evaluation'] = complex_evaluation

df.head()

Evaluating Summaries: 100%|██████████| 100/100 [01:42<00:00,  1.02s/it]

	title	published_date	authors	description	section	content	link	top_image	simple_summary	complex_summary	simple_evaluation	complex_evaluation
2662	Laura Whitmore: I was gaslighted after raising...	2024-08-04	https://www.facebook.com/bbcnews	The former Love Island host said that things s...	Culture	Television presenter Laura Whitmore has said t...	http://www.bbc.co.uk/news/articles/c9wvwvzm7x7o	https://ichef.bbci.co.uk/ace/standard/2560/cps...	Television presenter Laura Whitmore has spoken...	1. Type of News: Entertainment/Television\...	categorization=4 keyword_extraction=3 sentimen...	categorization=5 keyword_extraction=5 sentimen...
1865	Errollyn Wallen appointed as Master of the Kin...	2024-08-25	https://www.facebook.com/bbcnews	She is best known for her work on the 2012 Par...	Culture	Celebrated composer and singer-songwriter Erro...	http://www.bbc.co.uk/news/articles/c4gl758g7zgo	https://ichef.bbci.co.uk/ace/standard/2560/cps...	Errollyn Wallen has been appointed Master of t...	1. Type of News: Arts/Music\n\n2. **Summar...	categorization=4 keyword_extraction=4 sentimen...	categorization=5 keyword_extraction=5 sentimen...
2554	SDLP: Matthew O'Toole endorses Claire Hanna fo...	2024-08-30	https://www.facebook.com/bbcnews	Matthew O'Toole had been named by some as a po...	Northern Ireland Politics	Matthew O'Toole leads his party's official opp...	http://www.bbc.co.uk/news/articles/cvg41j7xrzdo	https://ichef.bbci.co.uk/ace/standard/3840/cps...	Matthew O'Toole, the leader of the official op...	1. Type of News: Politics\n\n2. *Summary...	categorization=5 keyword_extraction=4 sentimen...	categorization=5 keyword_extraction=5 sentimen...
1338	Rotherham rioters among those jailed - BBC News	2024-08-20	https://www.facebook.com/bbcnews	Two men who were part of a mob targeting a Hol...	South Yorkshire	Rotherham pair among those jailed for UK rioti...	http://www.bbc.co.uk/news/articles/cwywggd7qw6o	https://ichef.bbci.co.uk/ace/standard/2560/cps...	Two men, Nathan Palmer (29) and Niven Matthewm...	1. Type of News: Politics / Crime and Just...	categorization=3 keyword_extraction=3 sentimen...	categorization=5 keyword_extraction=4 sentimen...
1232	BBC News - BBC iPlayer	2024-08-02	None	None	None	JavaScript seems to be disabled. Please enable...	http://www.bbc.co.uk/news/10318089		The article discusses the need to enable JavaS...	I cannot provide a summary of the article as t...	categorization=2 keyword_extraction=3 sentimen...	categorization=1 keyword_extraction=1 sentimen...

import matplotlib.pyplot as plt

df["simple_scores"] = df["simple_evaluation"].apply(lambda x: [score for key, score in x.model_dump().items() if key != 'justification'])
df["complex_scores"] = df["complex_evaluation"].apply(lambda x: [score for key, score in x.model_dump().items() if key != 'justification'])


# Calculate average scores for each criterion
criteria = [
    'Categorisation',
    'Keywords and Tags',
    'Sentiment Analysis',
    'Clarity and Structure',
    'Detail and Completeness'
]

# Calculate average scores for each criterion by model
simple_avg_scores = df['simple_scores'].apply(pd.Series).mean()
complex_avg_scores = df['complex_scores'].apply(pd.Series).mean()


# Prepare data for plotting
avg_scores_df = pd.DataFrame({
    'Criteria': criteria,
    'Original Prompt': simple_avg_scores,
    'Improved Prompt': complex_avg_scores
})

# Plotting
ax = avg_scores_df.plot(x='Criteria', kind='bar', figsize=(6, 4))
plt.ylabel('Average Score')
plt.title('Comparison of Simple vs Complex Prompt Performance by Model')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.show()

png

在评估结果后，我们发现虽然基本提示在清晰度和结构方面表现良好，但增强提示在分类、关键词和标签、情感分析以及细节和完整性等几个关键标准上显著提高了输出质量。复杂提示生成的摘要信息更丰富、组织更好、内容更充实。

这表明改进提示可以极大地提高生成摘要的质量。尽管这是一个简化的示例，但在实际生产级应用中，提示优化的好处预计会更加明显，从而产生更符合特定目标和用户需求的结果。

结论

元提示是一种强大的技术，可以显著提高语言模型的输出质量。我们的探索表明，从一个简单的提示开始，并使用 o1-preview 进行改进，可以生成更具信息量、组织更好、内容更丰富的摘要 — 在分类、关键词和标签、情感分析以及完整性等关键标准上都有所提高。这个练习强调了提示优化的价值，即使在这个简化的示例中，其好处也很明显。在实际应用中，利用元提示和 o1-preview 等工具可以提升语言模型的性能，更好地满足您的特定目标和用户需求。