利用模型蒸馏微调模型

OpenAI 最近发布了 Distillation，它允许利用（大型）模型的输出来微调另一个（较小的）模型。当你转向更小的模型时，这可以显著降低特定任务的价格和延迟。在这个 cookbook 中，我们将研究一个数据集，将 gpt-4o 的输出蒸馏到 gpt-4o-mini，并展示我们如何获得比通用、未蒸馏的 4o-mini 好得多的结果。

我们还将利用 Structured Outputs 来解决使用枚举列表进行的分类问题。我们将了解微调模型如何从结构化输出中受益以及它将如何影响性能。我们将展示 Structured Ouputs 可以与所有这些模型一起使用，包括蒸馏后的模型。

我们将首先分析数据集，获取 4o 和 4o mini 两者的输出，突出这两个模型在性能上的差异，然后进行蒸馏，并分析这个蒸馏模型的性能。

先决条件

让我们安装并加载依赖项。确保你的 OpenAI API 密钥已在你的环境中定义为 "OPENAI_API_KEY"，客户端将直接加载它。

! pip install openai tiktoken numpy pandas tqdm --quiet

import openai
import json
import tiktoken
from tqdm import tqdm
from openai import OpenAI
import numpy as np
import concurrent.futures
import pandas as pd

client = OpenAI()

加载和理解数据集

对于这个 cookbook，我们将从以下 Kaggle 挑战中加载数据：https://www.kaggle.com/datasets/zynicide/wine-reviews。

这个数据集有很多行，你可以随意在整个数据上运行这个 cookbook，但作为一个有偏见的法国葡萄酒爱好者，我将把数据集缩小到只有法国葡萄酒，以关注更少的行和葡萄品种。

我们正在处理一个分类问题，我们希望根据所有其他可用标准（包括我们将在提示中包含的描述、子区域和省份）来猜测葡萄品种。这为模型提供了大量信息，你可以随意删除一些信息，例如葡萄酒的产区，看看它在寻找葡萄品种方面做得如何。

让我们过滤掉在评论中出现少于 5 次的葡萄品种。

让我们用这个数据集的 500 个随机行的子集来继续。

df = pd.read_csv('data/winemag/winemag-data-130k-v2.csv')
df_france = df[df['country'] == 'France']

# 我们也过滤掉那些葡萄品种的参考次数少于 5 次的葡萄酒——尽管我们想找到那些
# 它们是我们不希望优化的异常值，这会使我们的枚举列表太长
# 而且它们也可能为我们想要猜测的其余数据集增加噪音，最终降低我们的准确性。

varieties_less_than_five_list = df_france['variety'].value_counts()[df_france['variety'].value_counts() < 5].index.tolist()
df_france = df_france[~df_france['variety'].isin(varieties_less_than_five_list)]

df_france_subset = df_france.sample(n=500)
df_france_subset.head()

	Unnamed: 0	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
95206	95206	France	Full, fat, ripe, perfumed wine that is full of...	Château de Mercey Premier Cru	91	35.0	Burgundy	Mercurey	NaN	Roger Voss	@vossroger	Antonin Rodet 2010 Château de Mercey Premier C...	Pinot Noir	Antonin Rodet
66403	66403	France	For simple Chablis, this is impressive, rich, ...	Domaine	89	26.0	Burgundy	Chablis	NaN	Roger Voss	@vossroger	William Fèvre 2005 Domaine (Chablis)	Chardonnay	William Fèvre
71277	71277	France	This 50-50 blend of Marselan and Merlot opens ...	La Remise	84	13.0	France Other	Vin de France	NaN	Lauren Buzzeo	@laurbuzz	Domaine de la Mordorée 2014 La Remise Red (Vin...	Red Blend	Domaine de la Mordorée
27484	27484	France	The medium-intense nose of this solid and easy...	Authentic & Chic	86	10.0	France Other	Vin de France	NaN	Lauren Buzzeo	@laurbuzz	Romantic 2014 Authentic & Chic Cabernet Sauvig...	Cabernet Sauvignon	Romantic
124917	124917	France	Fresh, pure notes of Conference pear peel enti...	NaN	89	30.0	Alsace	Alsace	NaN	Anne Krebiehl MW	@AnneInVino	Domaine Vincent Stoeffler 2015 Pinot Gris (Als...	Pinot Gris	Domaine Vincent Stoeffler

让我们检索所有葡萄品种，将它们包含在提示和我们的结构化输出枚举列表中。

varieties = np.array(df_france['variety'].unique()).astype('str')
varieties

array(['Gewürztraminer', 'Pinot Gris', 'Gamay',
       'Bordeaux-style White Blend', 'Champagne Blend', 'Chardonnay',
       'Petit Manseng', 'Riesling', 'White Blend', 'Pinot Blanc',
       'Alsace white blend', 'Bordeaux-style Red Blend', 'Malbec',
       'Tannat-Cabernet', 'Rhône-style Red Blend', 'Ugni Blanc-Colombard',
       'Savagnin', 'Pinot Noir', 'Rosé', 'Melon',
       'Rhône-style White Blend', 'Pinot Noir-Gamay', 'Colombard',
       'Chenin Blanc', 'Sylvaner', 'Sauvignon Blanc', 'Red Blend',
       'Chenin Blanc-Chardonnay', 'Cabernet Sauvignon', 'Cabernet Franc',
       'Syrah', 'Sparkling Blend', 'Duras', 'Provence red blend',
       'Tannat', 'Merlot', 'Malbec-Merlot', 'Chardonnay-Viognier',
       'Cabernet Franc-Cabernet Sauvignon', 'Muscat', 'Viognier',
       'Picpoul', 'Altesse', 'Provence white blend', 'Mondeuse',
       'Grenache-Syrah', 'G-S-M', 'Pinot Meunier', 'Cabernet-Syrah',
       'Vermentino', 'Marsanne', 'Colombard-Sauvignon Blanc',
       'Gros and Petit Manseng', 'Jacquère', 'Negrette', 'Mauzac',
       'Pinot Auxerrois', 'Grenache', 'Roussanne', 'Gros Manseng',
       'Tannat-Merlot', 'Aligoté', 'Chasselas', "Loin de l'Oeil",
       'Malbec-Tannat', 'Carignan', 'Colombard-Ugni Blanc', 'Sémillon',
       'Syrah-Grenache', 'Sciaccerellu', 'Auxerrois', 'Mourvèdre',
       'Tannat-Cabernet Franc', 'Braucol', 'Trousseau',
       'Merlot-Cabernet Sauvignon'], dtype='<U33')

生成提示

让我们构建一个函数来生成我们的提示，并为列表中的第一个葡萄酒尝试一下。

def generate_prompt(row, varieties):
    # 将品种列表格式化为逗号分隔的字符串
    variety_list = ', '.join(varieties)

    prompt = f"""
    Based on this wine review, guess the grape variety:
    This wine is produced by {row['winery']} in the {row['province']} region of {row['country']}.
    It was grown in {row['region_1']}. It is described as: "{row['description']}".
    The wine has been reviewed by {row['taster_name']} and received {row['points']} points.
    The price is {row['price']}.

    Here is a list of possible grape varieties to choose from: {variety_list}.

    What is the likely grape variety? Answer only with the grape variety name or blend from the list.
    """
    return prompt

# 使用特定行的示例用法
prompt = generate_prompt(df_france.iloc[0], varieties)
prompt

'\n    Based on this wine review, guess the grape variety:\n    This wine is produced by Trimbach in the Alsace region of France.\n    It was grown in Alsace. It is described as: "This dry and restrained wine offers spice in profusion. Balanced with acidity and a firm texture, it\'s very much for food.".\n    The wine has been reviewed by Roger Voss and received 87 points.\n    The price is 24.0.\n\n    Here is a list of possible grape varieties to choose from: Gewürztraminer, Pinot Gris, Gamay, Bordeaux-style White Blend, Champagne Blend, Chardonnay, Petit Manseng, Riesling, White Blend, Pinot Blanc, Alsace white blend, Bordeaux-style Red Blend, Malbec, Tannat-Cabernet, Rhône-style Red Blend, Ugni Blanc-Colombard, Savagnin, Pinot Noir, Rosé, Melon, Rhône-style White Blend, Pinot Noir-Gamay, Colombard, Chenin Blanc, Sylvaner, Sauvignon Blanc, Red Blend, Chenin Blanc-Chardonnay, Cabernet Sauvignon, Cabernet Franc, Syrah, Sparkling Blend, Duras, Provence red blend, Tannat, Merlot, Malbec-Merlot, Chardonnay-Viognier, Cabernet Franc-Cabernet Sauvignon, Muscat, Viognier, Picpoul, Altesse, Provence white blend, Mondeuse, Grenache-Syrah, G-S-M, Pinot Meunier, Cabernet-Syrah, Vermentino, Marsanne, Colombard-Sauvignon Blanc, Gros and Petit Manseng, Jacquère, Negrette, Mauzac, Pinot Auxerrois, Grenache, Roussanne, Gros Manseng, Tannat-Merlot, Aligoté, Chasselas, Loin de\'Oeil, Malbec-Tannat, Carignan, Colombard-Ugni Blanc, Sémillon, Syrah-Grenache, Sciaccerellu, Auxerrois, Mourvèdre, Tannat-Cabernet Franc, Braucol, Trousseau, Merlot-Cabernet Sauvignon.\n    \n    What is the likely grape variety? Answer only with the grape variety name or blend from the list.\n    '

为了在运行查询之前了解成本，你可以利用 tiktoken 来了解我们将发送的令牌数量以及运行相关的成本。这只会给你一个运行完成的估计，而不是微调过程（在本 cookbook 后面运行蒸馏时使用），后者取决于其他因素，例如 epoch 数量、训练集等。

# 加载 GPT-4o 模型的编码
enc = tiktoken.encoding_for_model("gpt-4o")

# 初始化一个变量来存储令牌总数
total_tokens = 0

for index, row in df_france_subset.iterrows():
    prompt = generate_prompt(row, varieties)

    # 对输入文本进行标记化并计算令牌数
    tokens = enc.encode(prompt)
    token_count = len(tokens)

    # 将令牌数添加到总数中
    total_tokens += token_count

print(f"Total number of tokens in the dataset: {total_tokens}")
print(f"Total number of prompts: {len(df_france_subset)}")

Total number of tokens in the dataset: 245439
Total number of prompts: 500

# 输出成本（截至 2024/10/16 的美元）

gpt4o_token_price = 2.50 / 1_000_000  # 每 100 万个令牌 2.50 美元
gpt4o_mini_token_price = 0.150 / 1_000_000  # 每 100 万个令牌 0.150 美元

total_gpt4o_cost = gpt4o_token_price*total_tokens
total_gpt4o_mini_cost = gpt4o_mini_token_price*total_tokens

print(total_gpt4o_cost)
print(total_gpt4o_mini_cost)

0.6135975
0.03681585

准备存储完成的函数

由于我们正在处理一个有限的响应列表（葡萄品种的枚举列表），让我们利用结构化输出来确保模型会从这个列表中回答。这还允许我们将模型的答案与葡萄品种直接进行比较，并获得确定性的答案（与模型可能回答“我认为葡萄是黑皮诺”而不是仅仅“黑皮诺”相比），此外还可以提高性能，避免使用我们数据集中不存在的葡萄品种。

如果你想了解更多关于结构化输出的信息，可以阅读这个 cookbook 和这个文档指南。

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "grape-variety",
        "schema": {
            "type": "object",
            "properties": {
                "variety": {
                    "type": "string",
                    "enum": varieties.tolist()
                }
            },
            "additionalProperties": False,
            "required": ["variety"],
        },
        "strict": True
    }
}

要蒸馏模型，你需要存储模型的所有完成项，允许你将其作为参考提供给较小的模型进行微调。因此，我们在 client.chat.completions.create 方法中添加了 store=True 参数，以便我们可以存储来自 gpt-4o 的这些完成项。

我们将存储所有完成项（包括 4o-mini 和我们未来的微调模型），以便我们可以直接从 OpenAI 平台运行 Evals。

存储这些完成项时，使用元数据标签存储它们很有用，这允许从 OpenAI 平台进行过滤，以便在你想要运行蒸馏和评估的特定完成项集上运行它们。

# 初始化进度索引
metadata_value = "wine-distillation" # 这是一个有趣的元数据标签 :-)

# 调用 API 并处理单个模型结果的函数（在此情况下为阻塞调用）
def call_model(model, prompt):
    response = client.chat.completions.create(
        model=model,
        store=True,
        metadata={
            "distillation": metadata_value,
        },
        messages=[
            {
                "role": "system",
                "content": "You're a sommelier expert and you know everything about wine. You answer precisely with the name of the variety/blend."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
         response_format=response_format
    )
    return json.loads(response.choices[0].message.content.strip())['variety']

并行处理

由于我们将在大量行上运行此操作，因此请确保我们并行运行这些完成项并为此使用并发未来。我们将遍历我们的数据框，并每 20 行输出一次进度。我们将使用列名 {model}-variety 将我们运行完成的模型的结果存储在同一个数据框中。

def process_example(index, row, model, df, progress_bar):
    global progress_index

    try:
        # 使用行生成提示
        prompt = generate_prompt(row, varieties)

        df.at[index, model + "-variety"] = call_model(model, prompt)

        # 更新进度条
        progress_bar.update(1)

        progress_index += 1
    except Exception as e:
        print(f"Error processing model {model}: {str(e)}")

def process_dataframe(df, model):
    global progress_index
    progress_index = 1  # 重置进度索引

    # 使用 tqdm 创建进度条
    with tqdm(total=len(df), desc="Processing rows") as progress_bar:
        # 使用 ThreadPoolExecutor 并发处理每个示例
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = {executor.submit(process_example, index, row, model, df, progress_bar): index for index, row in df.iterrows()}

            for future in concurrent.futures.as_completed(futures):
                try:
                    future.result()  # 等待每个示例被处理
                except Exception as e:
                    print(f"Error processing example: {str(e)}")

    return df

让我们在处理整个数据框之前尝试调用我们的模型函数，并检查输出。

answer = call_model('gpt-4o', generate_prompt(df_france_subset.iloc[0], varieties))
answer

'Pinot Noir'

很好！我们确认可以获得葡萄品种作为输出，现在让我们用 gpt-4o 和 gpt-4o-mini 来处理数据集并比较结果。

df_france_subset = process_dataframe(df_france_subset, "gpt-4o")

Processing rows: 100%|███████████████████████████████████████████████| 500/500 [00:41<00:00, 12.09it/s]

df_france_subset = process_dataframe(df_france_subset, "gpt-4o-mini")

Processing rows: 100%|███████████████████████████████████████████████| 500/500 [01:31<00:00,  5.45it/s]

比较 gpt-4o 和 gpt-4o-mini

现在我们已经获得了这两个模型的聊天完成项；让我们将它们与预期的葡萄品种进行比较，并评估它们找到葡萄品种的准确性。我们在这里直接在 Python 中进行，因为我们有一个简单的字符串检查要运行，但如果你的任务涉及更复杂的评估，你可以直接利用 OpenAI Evals 或我们的开源评估框架。

models = ['gpt-4o', 'gpt-4o-mini']

def get_accuracy(model, df):
    return np.mean(df['variety'] == df[model + '-variety'])

for model in models:
    print(f"{model} accuracy: {get_accuracy(model, df_france_subset) * 100:.2f}%")

gpt-4o accuracy: 81.80%
gpt-4o-mini accuracy: 69.00%

我们可以看到 gpt-4o 在查找葡萄品种方面比 4o-mini 更好（高出 12.80%，相对于 4o-mini 而言几乎是 20%！）。现在我想知道我们是否在训练期间让 gpt-4o 喝葡萄酒！

将 gpt-4o 输出蒸馏到 gpt-4o-mini

假设我们想经常运行此预测，我们希望完成速度更快、成本更低，但仍保持此准确性水平。能够将 4o 的准确性蒸馏到 4o-mini 将是很好的，不是吗？我们开始吧！

我们现在转到 OpenAI 已存储的完成项页面：https://platform.openai.com/chat-completions。

让我们选择 gpt-4o 模型（请务必这样做，你不想蒸馏我们运行的 4o-mini 的输出）。我们还选择元数据 distillation: wine-distillation 以仅获取从本 cookbook 运行的已存储完成项。

Filtering out completions

选择完成项后，你可以单击右上角的“Distill”以基于这些完成项微调模型。完成后，将自动创建一个用于运行微调过程的文件。然后，我们选择 gpt-4o-mini 作为基础模型，保留默认参数（但你可以自由更改它们或与其进行迭代以提高性能）。

Distilling modal

一旦微调作业开始，你就可以从微调页面检索微调作业 ID，我们将使用它来监控微调作业的状态，并在完成后检索微调模型 ID。

Fine tuning job

# 在下方复制粘贴你的微调作业 ID
finetune_job = client.fine_tuning.jobs.retrieve("ftjob-pRyNWzUItmHpxmJ1TX7FOaWe")

if finetune_job.status == 'succeeded':
    fine_tuned_model = finetune_job.fine_tuned_model
    print('finetuned model: ' + fine_tuned_model)
else:
    print('finetuned job status: ' + finetune_job.status)

finetuned model: ft:gpt-4o-mini-2024-07-18:distillation-test:wine-distillation:AIZntSyE

运行蒸馏模型的完成项

现在我们已经微调了模型，我们可以使用这个模型来运行完成项，并与 gpt4o 和 gpt4o-mini 进行比较。让我们获取另一部分法国葡萄酒（因为我们将输出限制为法国葡萄品种，没有异常值，我们也需要将验证数据集也集中在此）。让我们对每个模型运行 300 个条目。

validation_dataset = df_france.sample(n=300)

models.append(fine_tuned_model)

for model in models:
    another_subset = process_dataframe(validation_dataset, model)

Processing rows: 100%|███████████████████████████████████████████████| 300/300 [00:20<00:00, 14.69it/s]
Processing rows: 100%|███████████████████████████████████████████████| 300/300 [00:27<00:00, 10.99it/s]
Processing rows: 100%|███████████████████████████████████████████████| 300/300 [00:37<00:00,  8.08it/s]

让我们比较模型的准确性

for model in models:
    print(f"{model} accuracy: {get_accuracy(model, another_subset) * 100:.2f}%")

gpt-4o accuracy: 79.67%
gpt-4o-mini accuracy: 64.67%
ft:gpt-4o-mini-2024-07-18:distillation-test:wine-distillation:AIZntSyE accuracy: 79.33%

这比未蒸馏的 gpt-4o-mini 相对提高了近 22%！🎉

我们的微调模型比 gpt-4o-mini 表现好得多，同时拥有相同的基本模型。我们将能够使用此模型以更低的成本和更低的延迟运行推理，以进行未来的葡萄品种预测。