使用文件搜索工具的响应 API

尽管 RAG 可能令人不知所措,但搜索 PDF 文件不应很复杂。目前最被广泛采用的选项之一是解析您的 PDF,定义您的分块策略,将这些块上传到存储提供商,对这些文本块运行嵌入,并将这些嵌入存储在向量数据库中。而这仅仅是设置——在我们的 LLM 工作流中检索内容也需要多个步骤。

这就是文件搜索——您可以在响应 API 中使用的托管工具——的作用所在。它允许您搜索知识库并根据检索到的内容生成答案。在本指南中,我们将把这些 PDF 上传到 OpenAI 的向量存储中,并使用文件搜索从该向量存储中获取额外上下文来回答我们在第一步中生成的问题。然后,我们将根据从 OpenAI 博客(openai.com/news)提取的 PDF,初步创建一小组问题。

文件搜索以前在 Assistants API 中可用。它现在可在新的响应 API 中使用,这是一个有状态或无状态的 API,并具有元数据过滤等新功能

使用我们的 PDF 创建向量存储

!pip install PyPDF2 pandas tqdm openai -q
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import concurrent
import PyPDF2
import os
import pandas as pd
import base64

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
dir_pdfs = 'openai_blog_pdfs' # have those PDFs stored locally here
pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]

我们将使用 OpenAI API 创建一个向量存储,并将我们的 PDF 上传到该向量存储。OpenAI 将读取这些 PDF,将内容分成多个文本块,对它们运行嵌入,并将这些嵌入和文本存储在向量存储中。这将使我们能够查询此向量存储,以根据查询返回相关内容。

def upload_single_pdf(file_path: str, vector_store_id: str):
    file_name = os.path.basename(file_path)
    try:
        file_response = client.files.create(file=open(file_path, 'rb'), purpose="assistants")
        attach_response = client.vector_stores.files.create(
            vector_store_id=vector_store_id,
            file_id=file_response.id
        )
        return {"file": file_name, "status": "success"}
    except Exception as e:
        print(f"Error with {file_name}: {str(e)}")
        return {"file": file_name, "status": "failed", "error": str(e)}

def upload_pdf_files_to_vector_store(vector_store_id: str):
    pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]
    stats = {"total_files": len(pdf_files), "successful_uploads": 0, "failed_uploads": 0, "errors": []}

    print(f"{len(pdf_files)} PDF files to process. Uploading in parallel...")

    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(upload_single_pdf, file_path, vector_store_id): file_path for file_path in pdf_files}
        for future in tqdm(concurrent.futures.as_completed(futures), total=len(pdf_files)):
            result = future.result()
            if result["status"] == "success":
                stats["successful_uploads"] += 1
            else:
                stats["failed_uploads"] += 1
                stats["errors"].append(result)

    return stats

def create_vector_store(store_name: str) -> dict:
    try:
        vector_store = client.vector_stores.create(name=store_name)
        details = {
            "id": vector_store.id,
            "name": vector_store.name,
            "created_at": vector_store.created_at,
            "file_count": vector_store.file_counts.completed
        }
        print("Vector store created:", details)
        return details
    except Exception as e:
        print(f"Error creating vector store: {e}")
        return {}
store_name = "openai_blog_store"
vector_store_details = create_vector_store(store_name)
upload_pdf_files_to_vector_store(vector_store_details["id"])
Vector store created: {'id': 'vs_67d06b9b9a9c8191bafd456cf2364ce3', 'name': 'openai_blog_store', 'created_at': 1741712283, 'file_count': 0}
21 PDF files to process. Uploading in parallel...


100%|███████████████████████████████| 21/21 [00:09<00:00,  2.32it/s]





{'total_files': 21,
 'successful_uploads': 21,
 'failed_uploads': 0,
 'errors': []}

单独的向量搜索

现在我们的向量存储已经准备就绪,我们可以直接查询向量存储并检索相关内容。使用新的 向量搜索 API,我们可以在不一定将其集成到 LLM 查询的情况下,从知识库中查找相关项。

query = "What's Deep Research?"
search_results = client.vector_stores.search(
    vector_store_id=vector_store_details['id'],
    query=query
)
for result in search_results.data:
    print(str(len(result.content[0].text)) + ' of character of content from ' + result.filename + ' with a relevant score of ' + str(result.score))
3502 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9813588865322393
3493 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9522476825143714
3634 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9397930296526796
2774 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9101975747303771
3474 of character of content from Deep research System Card _ OpenAI.pdf with a relevant score of 0.9036647613464299
3123 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.887120981288272
3343 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.8448454849432881
3262 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.791345286655509
3271 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.7485530025091963
2721 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.734033360849088

我们可以看到,搜索查询返回了不同大小(以及底层不同的文本)的内容。它们都具有由我们的排名器计算出的不同相关性分数,该排名器使用混合搜索。

将搜索结果与 LLM 集成在单个 API 调用中

然而,与其查询向量存储然后将数据传递到响应或聊天完成 API 调用,不如更方便地使用 LLM 查询中的搜索结果,方法是将 file_search 工具作为 OpenAI 响应 API 的一部分插入。

query = "What's Deep Research?"
response = client.responses.create(
    input= query,
    model="gpt-4o-mini",
    tools=[{
        "type": "file_search",
        "vector_store_ids": [vector_store_details['id']],
    }]
)

# Extract annotations from the response
annotations = response.output[1].content[0].annotations

# Get top-k retrieved filenames
retrieved_files = set([result.filename for result in annotations])

print(f'Files used: {retrieved_files}')
print('Response:')
print(response.output[1].content[0].text) # 0 being the filesearch call
Files used: {'Introducing deep research _ OpenAI.pdf'}
Response:
Deep Research is a new capability introduced by OpenAI that allows users to conduct complex, multi-step research tasks on the internet efficiently. Key features include:

1. **Autonomous Research**: Deep Research acts as an independent agent that synthesizes vast amounts of information across the web, enabling users to receive comprehensive reports similar to those produced by a research analyst.

2. **Multi-Step Reasoning**: It performs deep analysis by finding, interpreting, and synthesizing data from various sources, including text, images, and PDFs.

3. **Application Areas**: Especially useful for professionals in fields such as finance, science, policy, and engineering, as well as for consumers seeking detailed information for purchases.

4. **Efficiency**: The output is fully documented with citations, making it easy to verify information, and it significantly speeds up research processes that would otherwise take hours for a human to complete.

5. **Limitations**: While Deep Research enhances research capabilities, it is still subject to limitations, such as potential inaccuracies in information retrieval and challenges in distinguishing authoritative data from unreliable sources.

Overall, Deep Research marks a significant advancement toward automated general intelligence (AGI) by improving access to thorough and precise research outputs.

我们可以看到 gpt-4o-mini 能够回答一个需要关于 OpenAI 的 Deep Research 的更新、专业知识的查询。它使用了来自 Introducing deep research _ OpenAI.pdf 文件的内容,该文件包含最相关的文本块。如果我们想对检索到的文本块进行更深入的分析,我们还可以通过在查询中添加 include=["output[*].file_search_call.search_results"] 来分析由搜索引擎返回的不同文本。

评估性能

对于这些信息检索系统来说,关键是也要衡量检索到的文件与这些答案的相关性和质量。本指南的后续步骤将包括生成评估数据集,并在此生成的数据集上计算不同的指标。这是一种不完美的方法,我们始终建议为您的特定用例准备一个经过人工验证的评估数据集,但它将向您展示评估这些数据集的方法。它之所以不完美,是因为生成的一些问题可能是通用的(例如:本文档的主要利益相关者说了什么),而我们的检索测试将难以确定该问题是为哪个文档生成的。

生成评估

我们将创建一些函数,这些函数将读取我们本地的 PDF 文件并生成一个只能由该文档回答的问题。因此,它将创建我们可以稍后使用的评估数据集。

def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        with open(pdf_path, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
    return text

def generate_questions(pdf_path):
    text = extract_text_from_pdf(pdf_path)

    prompt = (
        "Can you generate a question that can only be answered from this document?:\n"
        f"{text}\n\n"
    )

    response = client.responses.create(
        input=prompt,
        model="gpt-4o",
    )

    question = response.output[0].content[0].text

    return question

如果我们为第一个 PDF 文件运行 generate_question 函数,我们将能够看到它生成的问题类型。

generate_questions(pdf_files[0])
'What new capabilities will ChatGPT have as a result of the partnership between OpenAI and Schibsted Media Group?'

我们现在可以为我们存储的每个 PDF 文件生成所有问题,并将它们存储在一个字典中。

# Generate questions for each PDF and store in a dictionary
questions_dict = {}
for pdf_path in pdf_files:
    questions = generate_questions(pdf_path)
    questions_dict[os.path.basename(pdf_path)] = questions
questions_dict
{'OpenAI partners with Schibsted Media Group _ OpenAI.pdf': 'What is the purpose of the partnership between Schibsted Media Group and OpenAI announced on February 10, 2025?',
 'OpenAI and the CSU system bring AI to 500,000 students & faculty _ OpenAI.pdf': 'What significant milestone did the California State University system achieve by partnering with OpenAI, making it the first of its kind in the United States?',
 '1,000 Scientist AI Jam Session _ OpenAI.pdf': 'What was the specific AI model used during the "1,000 Scientist AI Jam Session" event across the nine national labs?',
 'Announcing The Stargate Project _ OpenAI.pdf': 'What are the initial equity funders and lead partners in The Stargate Project announced by OpenAI, and who holds the financial and operational responsibilities?',
 'Introducing Operator _ OpenAI.pdf': 'What is the name of the new model that powers the Operator agent introduced by OpenAI?',
 'Introducing NextGenAI _ OpenAI.pdf': 'What major initiative did OpenAI launch on March 4, 2025, and which research institution from Europe is involved as a founding partner?',
 'Introducing the Intelligence Age _ OpenAI.pdf': "What is the name of the video generation tool used by OpenAI's creative team to help produce their Super Bowl ad?",
 'Operator System Card _ OpenAI.pdf': 'What is the preparedness score for the "Cybersecurity" category according to the Operator System Card?',
 'Strengthening America’s AI leadership with the U.S. National Laboratories _ OpenAI.pdf': "What is the purpose of OpenAI's agreement with the U.S. National Laboratories as described in the document?",
 'OpenAI GPT-4.5 System Card _ OpenAI.pdf': 'What is the Preparedness Framework rating for "Cybersecurity" for GPT-4.5 according to the system card?',
 'Partnering with Axios expands OpenAI’s work with the news industry _ OpenAI.pdf': "What is the goal of OpenAI's new content partnership with Axios as announced in the document?",
 'OpenAI and Guardian Media Group launch content partnership _ OpenAI.pdf': 'What is the main purpose of the partnership between OpenAI and Guardian Media Group announced on February 14, 2025?',
 'Introducing GPT-4.5 _ OpenAI.pdf': 'What is the release date of the GPT-4.5 research preview?',
 'Introducing data residency in Europe _ OpenAI.pdf': 'What are the benefits of data residency in Europe for new ChatGPT Enterprise and Edu customers according to the document?',
 'The power of personalized AI _ OpenAI.pdf': 'What is the purpose of the "Model Spec" document published by OpenAI for ChatGPT?',
 'Disrupting malicious uses of AI _ OpenAI.pdf': "What is OpenAI's mission as stated in the document?",
 'Sharing the latest Model Spec _ OpenAI.pdf': 'What is the release date of the latest Model Spec mentioned in the document?',
 'Deep research System Card _ OpenAI.pdf': "What specific publication date is mentioned in the Deep Research System Card for when the report on deep research's preparedness was released?",
 'Bertelsmann powers creativity and productivity with OpenAI _ OpenAI.pdf': 'What specific AI-powered solutions is Bertelsmann planning to implement for its divisions RTL Deutschland and Penguin Random House according to the document?',
 'OpenAI’s Economic Blueprint _ OpenAI.pdf': 'What date and location is scheduled for the kickoff event of OpenAI\'s "Innovating for America" initiative as mentioned in the Economic Blueprint document?',
 'Introducing deep research _ OpenAI.pdf': 'What specific model powers the "deep research" capability in ChatGPT that is discussed in this document, and what are its main features designed for?'}

我们现在有一个 filename:question 的字典,我们可以循环遍历并询问 gpt-4o(-mini),而无需提供文档,gpt-4o 应该能够找到相关的文档向量存储。

评估

我们将把字典转换为数据框,并使用 gpt-4o-mini 进行处理。我们将留意预期的文件。

rows = []
for filename, query in questions_dict.items():
    rows.append({"query": query, "_id": filename.replace(".pdf", "")})

# Metrics evaluation parameters
k = 5
total_queries = len(rows)
correct_retrievals_at_k = 0
reciprocal_ranks = []
average_precisions = []

def process_query(row):
    query = row['query']
    expected_filename = row['_id'] + '.pdf'
    # Call file_search via Responses API
    response = client.responses.create(
        input=query,
        model="gpt-4o-mini",
        tools=[{
            "type": "file_search",
            "vector_store_ids": [vector_store_details['id']],
            "max_num_results": k,
        }],
        tool_choice="required" # it will force the file_search, while not necessary, it's better to enforce it as this is what we're testing
    )
    # Extract annotations from the response
    annotations = None
    if hasattr(response.output[1], 'content') and response.output[1].content:
        annotations = response.output[1].content[0].annotations
    elif hasattr(response.output[1], 'annotations'):
        annotations = response.output[1].annotations

    if annotations is None:
        print(f"No annotations for query: {query}")
        return False, 0, 0

    # Get top-k retrieved filenames
    retrieved_files = [result.filename for result in annotations[:k]]
    if expected_filename in retrieved_files:
        rank = retrieved_files.index(expected_filename) + 1
        rr = 1 / rank
        correct = True
    else:
        rr = 0
        correct = False

    # Calculate Average Precision
    precisions = []
    num_relevant = 0
    for i, fname in enumerate(retrieved_files):
        if fname == expected_filename:
            num_relevant += 1
            precisions.append(num_relevant / (i + 1))
    avg_precision = sum(precisions) / len(precisions) if precisions else 0

    if expected_filename not in retrieved_files:
        print("Expected file NOT found in the retrieved files!")

    if retrieved_files and retrieved_files[0] != expected_filename:
        print(f"Query: {query}")
        print(f"Expected file: {expected_filename}")
        print(f"First retrieved file: {retrieved_files[0]}")
        print(f"Retrieved files: {retrieved_files}")
        print("-" * 50)


    return correct, rr, avg_precision
process_query(rows[0])
(True, 1.0, 1.0)

此示例的召回率和精确率均为 1,并且我们的文件排名第一,因此我们的 MRR 和 MAP 均为 1。

我们现在可以在我们的问题集上执行此处理。

with ThreadPoolExecutor() as executor:
    results = list(tqdm(executor.map(process_query, rows), total=total_queries))

correct_retrievals_at_k = 0
reciprocal_ranks = []
average_precisions = []

for correct, rr, avg_precision in results:
    if correct:
        correct_retrievals_at_k += 1
    reciprocal_ranks.append(rr)
    average_precisions.append(avg_precision)

recall_at_k = correct_retrievals_at_k / total_queries
precision_at_k = recall_at_k  # In this context, same as recall
mrr = sum(reciprocal_ranks) / total_queries
map_score = sum(average_precisions) / total_queries
 62%|███████████████████▏           | 13/21 [00:07<00:03,  2.57it/s]

Expected file NOT found in the retrieved files!
Query: What is OpenAI's mission as stated in the document?
Expected file: Disrupting malicious uses of AI _ OpenAI.pdf
First retrieved file: Introducing the Intelligence Age _ OpenAI.pdf
Retrieved files: ['Introducing the Intelligence Age _ OpenAI.pdf']
--------------------------------------------------


 71%|██████████████████████▏        | 15/21 [00:14<00:06,  1.04s/it]

Expected file NOT found in the retrieved files!
Query: What is the purpose of the "Model Spec" document published by OpenAI for ChatGPT?
Expected file: The power of personalized AI _ OpenAI.pdf
First retrieved file: Sharing the latest Model Spec _ OpenAI.pdf
Retrieved files: ['Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf']
--------------------------------------------------


100%|███████████████████████████████| 21/21 [00:15<00:00,  1.38it/s]

上面记录的输出要么显示文件未排在第一位(而我们的评估数据集期望它排在第一位),要么根本没有找到文件。正如我们从不完美的评估数据集中看到的,一些问题是通用的,期望的是另一个文档,而我们的检索系统没有特别为这个问题检索到它。

# Print the metrics with k
print(f"Metrics at k={k}:")
print(f"Recall@{k}: {recall_at_k:.4f}")
print(f"Precision@{k}: {precision_at_k:.4f}")
print(f"Mean Reciprocal Rank (MRR): {mrr:.4f}")
print(f"Mean Average Precision (MAP): {map_score:.4f}")
Metrics at k=5:
Recall@5: 0.9048
Precision@5: 0.9048
Mean Reciprocal Rank (MRR): 0.9048
Mean Average Precision (MAP): 0.8954

通过这个食谱,我们能够看到如何:

  • 使用 PDF 上下文填充(利用 4o 的视觉模式)和传统的 PDF 阅读器生成评估数据集。
  • 创建向量存储并用 PDF 填充它。
  • 使用 OpenAI 的响应 API 中开箱即用的 RAG 系统,让 LLM 回答查询,利用 file_search 工具调用。
  • 理解文本块是如何被检索、排名以及作为响应 API 的一部分使用的。
  • 在先前生成的评估数据集上衡量准确性、精确度、检索率、MRR 和 MAP。

通过将文件搜索与响应结合使用,您可以简化 RAG 架构,并使用新的响应 API 在单个 API 调用中利用它。文件存储、嵌入、检索都集成在一个工具中!