批量处理与 Batch API

新的 Batch API 允许以更低的价格和更高的速率限制来创建异步批量作业。

批处理作业将在 24 小时内完成，但可能会根据全局使用情况提前处理。

Batch API 的理想用例包括：

在市场或博客上标记、字幕或丰富内容
对支持工单进行分类和建议答案
对大量客户反馈数据执行情感分析
为文档或文章集合生成摘要或翻译

以及更多！

本指南将通过几个实际示例向您展示如何使用 Batch API。

我们将从一个使用 gpt-4o-mini 对电影进行分类的示例开始，然后介绍如何使用此模型的视觉能力为图像添加字幕。

请注意，Batch API 支持多种模型，并且您可以在 Batch API 调用中使用与 Chat Completions 端点相同的参数。

设置

# 确保您拥有最新版本的 SDK 以使用 Batch API
%pip install openai --upgrade

import json
from openai import OpenAI
import pandas as pd
from IPython.display import Image, display

# 初始化 OpenAI 客户端 - 请参阅 https://platform.openai.com/docs/quickstart?context=python
client = OpenAI()

第一个示例：对电影进行分类

在此示例中，我们将使用 gpt-4o-mini 从电影描述中提取电影类别。我们还将从该描述中提取一句摘要。

我们将使用 JSON 模式将类别提取为字符串数组，并将一句摘要提取为结构化格式。

对于每部电影，我们希望获得如下所示的结果：

{
    categories: ['category1', 'category2', 'category3'],
    summary: '一句摘要'
}

加载数据

我们将在此示例中使用 IMDB 前 1000 部电影数据集。

dataset_path = "data/imdb_top_1000.csv"

df = pd.read_csv(dataset_path)
df.head()

	Poster_Link	Series_Title	Released_Year	Certificate	Runtime	Genre	IMDB_Rating	Overview	Meta_score	Director	Star1	Star2	Star3	Star4	No_of_Votes	Gross
0	https://m.media-amazon.com/images/M/MV5BMDFkYT...	The Shawshank Redemption	1994	A	142 min	Drama	9.3	Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.	80.0	Frank Darabont	Tim Robbins	Morgan Freeman	Bob Gunton	William Sadler	2343110	28,341,469
1	https://m.media-amazon.com/images/M/MV5BM2MyNj...	The Godfather	1972	A	175 min	Crime, Drama	9.2	An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.	100.0	Francis Ford Coppola	Marlon Brando	Al Pacino	James Caan	Diane Keaton	1620367	134,966,411
2	https://m.media-amazon.com/images/M/MV5BMTMxNT...	The Dark Knight	2008	UA	152 min	Action, Crime, Drama	9.0	When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.	84.0	Christopher Nolan	Christian Bale	Heath Ledger	Aaron Eckhart	Michael Caine	2303232	534,858,444
3	https://m.media-amazon.com/images/M/MV5BMWMwMG...	The Godfather: Part II	1974	A	202 min	Crime, Drama	9.0	The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his grip on the family crime syndicate.	90.0	Francis Ford Coppola	Al Pacino	Robert De Niro	Robert Duvall	Diane Keaton	1129952	57,300,000
4	https://m.media-amazon.com/images/M/MV5BMWU4N2...	12 Angry Men	1957	U	96 min	Crime, Drama	9.0	A jury holdout attempts to prevent a miscarriage of justice by forcing his colleagues to reconsider the evidence.	96.0	Sidney Lumet	Henry Fonda	Lee J. Cobb	Martin Balsam	John Fiedler	689845	4,360,000

处理步骤

在这里，我们将首先尝试使用 Chat Completions 端点来准备我们的请求。

一旦我们对结果满意，我们就可以继续创建批处理文件。

categorize_system_prompt = '''
你的目标是从电影描述中提取电影类别，以及这些电影的一句话摘要。
你将获得电影描述，并将输出一个包含以下信息的 json 对象：

{
    categories: string[] // 基于电影描述的类别数组,
    summary: string // 基于电影描述的一句话摘要
}

类别是指电影的类型或体裁，例如“动作”、“浪漫”、“喜剧”等。保持类别名称简单，只使用小写字母。
电影可以有几个类别，但尽量保持在 3-4 个以内。只提及最明显的类别。
'''

def get_categories(description):
    response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.1,
    # 这是为了启用 JSON 模式，确保响应是有效的 json 对象
    response_format={ 
        "type": "json_object"
    },
    messages=[
        {
            "role": "system",
            "content": categorize_system_prompt
        },
        {
            "role": "user",
            "content": description
        }
    ],
    )

    return response.choices[0].message.content

# 测试几个示例
for _, row in df[:5].iterrows():
    description = row['Overview']
    title = row['Series_Title']
    result = get_categories(description)
    print(f"标题: {title}\n概述: {description}\n\n结果: {result}")
    print("\n\n----------------------------\n\n")

标题: The Shawshank Redemption
概述: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.

结果: {
    "categories": ["drama"],
    "summary": "Two imprisoned men develop a deep bond over the years, ultimately finding redemption through their shared acts of kindness."
}


----------------------------


标题: The Godfather
概述: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

结果: {
    "categories": ["crime", "drama"],
    "summary": "An aging crime lord hands over his empire to his hesitant son."
}


----------------------------


标题: The Dark Knight
概述: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

结果: {
    "categories": ["action", "thriller", "superhero"],
    "summary": "Batman faces a formidable challenge as the Joker unleashes chaos on Gotham City."
}


----------------------------


标题: The Godfather: Part II
概述: The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his grip on the family crime syndicate.

结果: {
    "categories": ["crime", "drama"],
    "summary": "The film depicts the early life of Vito Corleone and the rise of his son Michael within the family crime syndicate in 1920s New York City."
}


----------------------------


标题: 12 Angry Men
概述: A jury holdout attempts to prevent a miscarriage of justice by forcing his colleagues to reconsider the evidence.

结果: {
    "categories": ["drama", "thriller"],
    "summary": "A jury holdout fights to ensure justice is served by challenging his fellow jurors to reevaluate the evidence."
}


----------------------------

创建批处理文件

批处理文件采用 jsonl 格式，应包含每行一个请求（json 对象）。每个请求定义如下：

{
    "custom_id": <REQUEST_ID>,
    "method": "POST",
    "url": "/v1/chat/completions",
    "body": {
        "model": <MODEL>,
        "messages": <MESSAGES>,
        // 其他参数
    }
}

注意：请求 ID 在批处理中应是唯一的。您可以使用它来将结果与初始输入文件匹配，因为请求不会按相同顺序返回。

# 创建一个 json 任务数组

tasks = []

for index, row in df.iterrows():

    description = row['Overview']

    task = {
        "custom_id": f"task-{index}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            # 这是您在 Chat Completions API 调用中会有的内容
            "model": "gpt-4o-mini",
            "temperature": 0.1,
            "response_format": { 
                "type": "json_object"
            },
            "messages": [
                {
                    "role": "system",
                    "content": categorize_system_prompt
                },
                {
                    "role": "user",
                    "content": description
                }
            ],
        }
    }

    tasks.append(task)

# 创建文件

file_name = "data/batch_tasks_movies.jsonl"

with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')

上传文件

batch_file = client.files.create(
  file=open(file_name, "rb"),
  purpose="batch"
)

print(batch_file)

FileObject(id='file-lx16f1KyIxQ2UHVvkG3HLfNR', bytes=1127310, created_at=1721144107, filename='batch_tasks_movies.jsonl', object='file', purpose='batch', status='processed', status_details=None)

创建批处理作业

batch_job = client.batches.create(
  input_file_id=batch_file.id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)

检查批处理状态

注意：这可能需要长达 24 小时，但通常会更快完成。

您可以继续检查，直到状态为“completed”。

batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)

检索结果

result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content

result_file_name = "data/batch_job_results_movies.jsonl"

with open(result_file_name, 'wb') as file:
    file.write(result)

# 从保存的文件加载数据
results = []
with open(result_file_name, 'r') as file:
    for line in file:
        # 将 JSON 字符串解析为字典并添加到结果列表中
        json_object = json.loads(line.strip())
        results.append(json_object)

读取结果

提醒：结果的顺序与输入文件中的顺序不同。请务必检查 custom_id 以将结果与输入请求进行匹配

# 只读取前几个结果
for res in results[:5]:
    task_id = res['custom_id']
    # 从任务 ID 获取索引
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    movie = df.iloc[int(index)]
    description = movie['Overview']
    title = movie['Series_Title']
    print(f"标题: {title}\n概述: {description}\n\n结果: {result}")
    print("\n\n----------------------------\n\n")

标题: American Psycho
概述: A wealthy New York City investment banking executive, Patrick Bateman hides his alternate psychopathic ego from his co-workers and friends as he delves deeper into his violent, hedonistic fantasies.

结果: {
    "categories": ["thriller", "psychological", "drama"],
    "summary": "A wealthy investment banker in New York City conceals his psychopathic alter ego while indulging in violent and hedonistic fantasies."
}


----------------------------


标题: Lethal Weapon
概述: Two newly paired cops who are complete opposites must put aside their differences in order to catch a gang of drug smugglers.

结果: {
    "categories": ["action", "comedy", "crime"],
    "summary": "An action-packed comedy about two mismatched cops teaming up to take down a drug smuggling gang."
}


----------------------------


标题: A Star Is Born
概述: A musician helps a young singer find fame as age and alcoholism send his own career into a downward spiral.

结果: {
    "categories": ["drama", "music"],
    "summary": "A musician's career spirals downward as he helps a young singer find fame amidst struggles with age and alcoholism."
}


----------------------------


标题: From Here to Eternity
概述: In Hawaii in 1941, a private is cruelly punished for not boxing on his unit's team, while his captain's wife and second-in-command are falling in love.

结果: {
    "categories": ["drama", "romance", "war"],
    "summary": "A drama set in Hawaii in 1941, where a private faces punishment for not boxing on his unit's team, amidst a forbidden love affair between his captain's wife and second-in-command."
}


----------------------------


标题: The Jungle Book
概述: Bagheera the Panther and Baloo the Bear have a difficult time trying to convince a boy to leave the jungle for human civilization.

结果: {
    "categories": ["adventure", "animation", "family"],
    "summary": "An adventure-filled animated movie about a panther and a bear trying to persuade a boy to leave the jungle for human civilization."
}


----------------------------

第二个示例：为图像添加字幕

在此示例中，我们将使用 gpt-4-turbo 为家具图像生成字幕。

我们将使用该模型的视觉能力来分析图像并生成字幕。

加载数据

我们将为此示例使用 Amazon 家具数据集。

dataset_path = "data/amazon_furniture_dataset.csv"
df = pd.read_csv(dataset_path)
df.head()

	asin	url	title	brand	price	availability	categories	primary_image	images	upc	...	color	material	style	important_information	product_overview	about_item	description	specifications	uniq_id	scraped_at
0	B0CJHKVG6P	https://www.amazon.com/dp/B0CJHKVG6P	GOYMFK 1pc Free Standing Shoe Rack, Multi-laye...	GOYMFK	$24.99	Only 13 left in stock - order soon.	['Home & Kitchen', 'Storage & Organization', '...	https://m.media-amazon.com/images/I/416WaLx10j...	['https://m.media-amazon.com/images/I/416WaLx1...	NaN	...	White	Metal	Modern	[]	[{'Brand': ' GOYMFK '}, {'Color': ' White '}, ...	['Multiple layers: Provides ample storage spac...	multiple shoes, coats, hats, and other items E...	['Brand: GOYMFK', 'Color: White', 'Material: M...	02593e81-5c09-5069-8516-b0b29f439ded	2024-02-02 15:15:08
1	B0B66QHB23	https://www.amazon.com/dp/B0B66QHB23	subrtex Leather ding Room, Dining Chairs Set o...	subrtex	NaN	NaN	['Home & Kitchen', 'Furniture', 'Dining Room F...	https://m.media-amazon.com/images/I/31SejUEWY7...	['https://m.media-amazon.com/images/I/31SejUEW...	NaN	...	Black	Sponge	Black Rubber Wood	[]	NaN	['【Easy Assembly】: Set of 2 dining room chairs...	subrtex Dining chairs Set of 2	['Brand: subrtex', 'Color: Black', 'Product Di...	5938d217-b8c5-5d3e-b1cf-e28e340f292e	2024-02-02 15:15:09
2	B0BXRTWLYK	https://www.amazon.com/dp/B0BXRTWLYK	Plant Repotting Mat MUYETOL Waterproof Transpl...	MUYETOL	$5.98	In Stock	['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...	https://m.media-amazon.com/images/I/41RgefVq70...	['https://m.media-amazon.com/images/I/41RgefVq...	NaN	...	Green	Polyethylene	Modern	[]	[{'Brand': ' MUYETOL '}, {'Size': ' 26.8*26.8 ...	['PLANT REPOTTING MAT SIZE: 26.8" x 26.8", squ...	NaN	['Brand: MUYETOL', 'Size: 26.8*26.8', 'Item We...	b2ede786-3f51-5a45-9a5b-bcf856958cd8	2024-02-02 15:15:09
3	B0C1MRB2M8	https://www.amazon.com/dp/B0C1MRB2M8	Pickleball Doormat, Welcome Doormat Absorbent ...	VEWETOL	$13.99	Only 10 left in stock - order soon.	['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...	https://m.media-amazon.com/images/I/61vz1Igler...	['https://m.media-amazon.com/images/I/61vz1Igl...	NaN	...	A5589	Rubber	Modern	[]	[{'Brand': ' VEWETOL '}, {'Size': ' 16*24INCH ...	['Specifications: 16x24 Inch ', " High-Quality...	The decorative doormat features a subtle textu...	['Brand: VEWETOL', 'Size: 16*24INCH', 'Materia...	8fd9377b-cfa6-5f10-835c-6b8eca2816b5	2024-02-02 15:15:10
4	B0CG1N9QRC	https://www.amazon.com/dp/B0CG1N9QRC	JOIN IRON Foldable TV Trays for Eating Set of ...	JOIN IRON Store	$89.99	Usually ships within 5 to 6 weeks	['Home & Kitchen', 'Furniture', 'Game & Recrea...	https://m.media-amazon.com/images/I/41p4d4VJnN...	['https://m.media-amazon.com/images/I/41p4d4VJ...	NaN	...	Grey Set of 4	Iron	X Classic Style	[]	NaN	['Includes 4 Folding Tv Tray Tables And one Co...	Set of Four Folding Trays With Matching Storag...	['Brand: JOIN IRON', 'Shape: Rectangular', 'In...	bdc9aa30-9439-50dc-8e89-213ea211d66a	2024-02-02 15:15:11

5 rows × 25 columns

处理步骤

和第一个示例一样，我们将首先使用 Chat Completions 端点准备我们的请求，然后创建批处理文件。

caption_system_prompt = '''
你的目标是为物品图像生成简短的描述性字幕。
你将获得物品图像和物品名称，并将输出一个捕捉有关物品最重要信息的字幕。
如果描绘了多个物品，请参考提供的名称来了解应描述哪个物品。
你生成的字幕应简短（1 句话），并且只包含关于该物品的最重要信息。
最重要的信息可能是：物品类型、样式（如果提及）、材料或颜色（如果特别相关）和/或任何独特的特征。
保持简短，直奔主题。
'''

def get_caption(img_url, title):
    response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.2,
    max_tokens=300,
    messages=[
        {
            "role": "system",
            "content": caption_system_prompt
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": title
                },
                # content type 应为 "image_url"，以便使用 gpt-4-turbo 的视觉能力
                {
                    "type": "image_url",
                    "image_url": {
                        "url": img_url
                    }
                },
            ],
        }
    ]
    )

    return response.choices[0].message.content

# 测试几张图片
for _, row in df[:5].iterrows():
    img_url = row['primary_image']
    caption = get_caption(img_url, row['title'])
    img = Image(url=img_url)
    display(img)
    print(f"字幕: {caption}\n\n")

字幕: A stylish white free-standing shoe rack featuring multiple layers and eight double hooks, perfect for organizing shoes and accessories in living rooms, bathrooms, or hallways.

字幕: Set of 2 black leather dining chairs featuring a sleek design with vertical stitching and sturdy wooden legs.

字幕: The MUYETOL Plant Repotting Mat is a waterproof, portable, and foldable gardening work mat measuring 26.8" x 26.8", designed for easy soil changing and indoor transplanting.

字幕: Absorbent non-slip doormat featuring the phrase "It's a good day to play PICKLEBALL" with paddle graphics, measuring 16x24 inches.

字幕: Set of 4 foldable TV trays in grey, featuring a compact design with a stand for easy storage, perfect for small spaces.

创建批处理作业

与第一个示例一样，我们将创建一个 json 任务数组来生成 jsonl 文件，并使用它来创建批处理作业。

# 创建一个 json 任务数组

tasks = []

for index, row in df.iterrows():

    title = row['title']
    img_url = row['primary_image']

    task = {
        "custom_id": f"task-{index}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            # 这是您在 Chat Completions API 调用中会有的内容
            "model": "gpt-4o-mini",
            "temperature": 0.2,
            "max_tokens": 300,
            "messages": [
                {
                    "role": "system",
                    "content": caption_system_prompt
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": title
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": img_url
                            }
                        },
                    ],
                }
            ]            
        }
    }

    tasks.append(task)

# 创建文件

file_name = "data/batch_tasks_furniture.jsonl"

with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')

# 上传文件 

batch_file = client.files.create(
  file=open(file_name, "rb"),
  purpose="batch"
)

# 创建作业

batch_job = client.batches.create(
  input_file_id=batch_file.id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)

batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)

获取结果

与第一个示例一样，我们可以在批处理作业完成后检索结果。

提醒：结果的顺序与输入文件中的顺序不同。请务必检查 custom_id 以将结果与输入请求进行匹配

# 检索结果文件

result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content

result_file_name = "data/batch_job_results_furniture.jsonl"

with open(result_file_name, 'wb') as file:
    file.write(result)

# 从保存的文件加载数据

results = []
with open(result_file_name, 'r') as file:
    for line in file:
        # 将 JSON 字符串解析为字典并添加到结果列表中
        json_object = json.loads(line.strip())
        results.append(json_object)

# 只读取前几个结果
for res in results[:5]:
    task_id = res['custom_id']
    # 从任务 ID 获取索引
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    item = df.iloc[int(index)]
    img_url = item['primary_image']
    img = Image(url=img_url)
    display(img)
    print(f"字幕: {result}\n\n")

字幕: Brushed brass pedestal towel rack with a sleek, modern design, featuring multiple bars for hanging towels, measuring 25.75 x 14.44 x 32 inches.

字幕: Black round end table featuring a tempered glass top and a metal frame, with a lower shelf for additional storage.

字幕: Black collapsible and height-adjustable telescoping stool, portable and designed for makeup artists and hairstylists, shown in various stages of folding for easy transport.

字幕: Ergonomic pink gaming chair featuring breathable fabric, adjustable height, lumbar support, a footrest, and a swivel recliner function.

字幕: A set of two Glitzhome adjustable bar stools featuring a mid-century modern design with swivel seats, PU leather upholstery, and wooden backrests.

总结

在本指南中，我们看到了使用新的 Batch API 的两个示例，但请记住，Batch API 的工作方式与 Chat Completions 端点相同，支持相同的参数和大多数最新模型（gpt-4o、gpt-4o-mini、gpt-4-turbo、gpt-3.5-turbo...）。

通过使用此 API，您可以显著降低成本，因此我们建议将所有可以异步执行的工作负载切换到使用此新 API 的批处理作业。