长文档内容提取

GPT-3 可以帮助我们从无法放入上下文窗口的文档中提取关键数字、日期或其他重要内容。一种解决方法是将文档分块，分别处理每个块，然后将它们合并到一个答案列表中。

在本笔记本中，我们将介绍这种方法：

加载长 PDF 并提取文本
创建一个用于提取关键信息的提示
将文档分块并处理每个块以提取答案
最后进行合并
这种简单的方法将扩展到三个更难的问题

方法

设置：获取一份 PDF 文件，即动力单元的《F1 财务规定》，并提取其文本以进行实体提取。我们将使用它来尝试提取埋藏在内容中的答案。
简单实体提取：通过以下方式从文档块中提取关键信息：
- 创建一个包含我们问题和期望格式示例的模板提示
- 创建一个函数，接收文本块作为输入，与提示结合并获得响应
- 运行一个脚本来分块文本、提取答案并输出以供解析
复杂实体提取：提出一些需要更严格推理才能解决的更难的问题

设置

!pip install textract
!pip install tiktoken

import textract
import os
import openai
import tiktoken

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

# 使用 textract 从每个 PDF 中提取原始文本
text = textract.process('data/fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf', method='pdfminer').decode('utf-8')
clean_text = text.replace("  ", " ").replace("\n", "; ").replace(';',' ')

简单实体提取

# 示例提示 - 
document = '<document>'
template_prompt=f'''从这份规定文件中提取关键信息。
如果某个信息不存在，请输出“未指定”。
提取关键信息时，请包含最近的页码。
使用以下格式：

0. 作者是谁
1. “动力单元成本上限”的金额是多少（美元、英镑和欧元）
2. 外部制造成本的价值是多少（美元）
3. 资本支出限额是多少（美元）

文档：“””<document>“””

0. 作者是谁：Tom Anderson (第 1 页)
1.'''
print(template_prompt)

从这份规定文件中提取关键信息。
如果某个信息不存在，请输出“未指定”。
提取关键信息时，请包含最近的页码。
使用以下格式：

0. 作者是谁
1. “动力单元成本上限”的金额是多少（美元、英镑和欧元）
2. 外部制造成本的价值是多少（美元）
3. 资本支出限额是多少（美元）

文档：“””<document>“””

0. 作者是谁：Tom Anderson (第 1 页)
1.

# 将文本分割成大小为 n 的小块，最好在句子末尾结束
def create_chunks(text, n, tokenizer):
    tokens = tokenizer.encode(text)
    """Yield successive n-sized chunks from text."""
    i = 0
    while i < len(tokens):
        # 在 0.5 * n 和 1.5 * n 个 token 的范围内查找最近的句子结束点
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # 解码 token 并检查句号或换行符
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # 如果未找到句子结束点，则使用 n 个 token 作为块大小
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j

def extract_chunk(document,template_prompt):
    prompt = template_prompt.replace('<document>',document)

    messages = [
            {"role": "system", "content": "You help extract information from documents."},
            {"role": "user", "content": prompt}
            ]

    response = client.chat.completions.create(
            model='gpt-4', 
            messages=messages,
            temperature=0,
            max_tokens=1500,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
    return "1." + response.choices[0].message.content

# 初始化分词器
tokenizer = tiktoken.get_encoding("cl100k_base")

results = []

chunks = create_chunks(clean_text,1000,tokenizer)
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

for chunk in text_chunks:
    results.append(extract_chunk(chunk,template_prompt))
    #print(chunk)
    print(results[-1])

1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR: USD 95,000,000 (Page 2); GBP 76,459,000 (Page 2); EUR 90,210,000 (Page 2)
2. What is the value of External Manufacturing Costs in USD: US Dollars 20,000,000 in respect of each of the Full Year Reporting Periods ending on 31 December 2023, 31 December 2024 and 31 December 2025, adjusted for Indexation (Page 10)
3. What is the Capital Expenditure Limit in USD: US Dollars 30,000,000 (Page 32)

groups = [r.split('\n') for r in results]

# 将组压缩在一起
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not in x]
zipped

['1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR: USD 95,000,000 (Page 2); GBP 76,459,000 (Page 2); EUR 90,210,000 (Page 2)',
 '2. What is the value of External Manufacturing Costs in USD: US Dollars 20,000,000 in respect of each of the Full Year Reporting Periods ending on 31 December 2023, 31 December 2024 and 31 December 2025, adjusted for Indexation (Page 10)',
 '3. What is the Capital Expenditure Limit in USD: US Dollars 30,000,000 (Page 32)']

复杂实体提取

# 示例提示 - 
template_prompt=f'''从这份规定文件中提取关键信息。
如果某个信息不存在，请输出“未指定”。
提取关键信息时，请包含最近的页码。
使用以下格式：

0. 作者是谁
1. 如何计算轻微超支违规
2. 如何计算重大超支违规
3. 这些财务规定适用于哪些年份

文档：“””<document>“””

0. 作者是谁：Tom Anderson (第 1 页)
1.'''
print(template_prompt)

从这份规定文件中提取关键信息。
如果某个信息不存在，请输出“未指定”。
提取关键信息时，请包含最近的页码。
使用以下格式：

0. 作者是谁
1. 如何计算轻微超支违规
2. 如何计算重大超支违规
3. 这些财务规定适用于哪些年份

文档：“””<document>“””

0. 作者是谁：Tom Anderson (第 1 页)
1.

results = []

for chunk in text_chunks:
    results.append(extract_chunk(chunk,template_prompt))

groups = [r.split('\n') for r in results]

# 将组压缩在一起
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not in x]
zipped

['1. How is a Minor Overspend Breach calculated: A Minor Overspend Breach arises when a Power Unit Manufacturer submits its Full Year Reporting Documentation and Relevant Costs reported therein exceed the Power Unit Cost Cap by less than 5% (Page 24)',
 '2. How is a Major Overspend Breach calculated: A Material Overspend Breach arises when a Power Unit Manufacturer submits its Full Year Reporting Documentation and Relevant Costs reported therein exceed the Power Unit Cost Cap by 5% or more (Page 25)',
 '3. Which years do these financial regulations apply to: 2026 onwards (Page 1)',
 '3. Which years do these financial regulations apply to: 2023, 2024, 2025, 2026 and subsequent Full Year Reporting Periods (Page 2)',
 '3. Which years do these financial regulations apply to: 2022-2025 (Page 6)',
 '3. Which years do these financial regulations apply to: 2023, 2024, 2025, 2026 and subsequent Full Year Reporting Periods (Page 10)',
 '3. Which years do these financial regulations apply to: 2022 (Page 14)',
 '3. Which years do these financial regulations apply to: 2022 (Page 16)',
 '3. Which years do these financial regulations apply to: 2022 (Page 19)',
 '3. Which years do these financial regulations apply to: 2022 (Page 21)',
 '3. Which years do these financial regulations apply to: 2026 onwards (Page 26)',
 '3. Which years do these financial regulations apply to: 2026 (Page 2)',
 '3. Which years do these financial regulations apply to: 2022 (Page 30)',
 '3. Which years do these financial regulations apply to: 2022 (Page 32)',
 '3. Which years do these financial regulations apply to: 2023, 2024 and 2025 (Page 1)',
 '3. Which years do these financial regulations apply to: 2022 (Page 37)',
 '3. Which years do these financial regulations apply to: 2026 onwards (Page 40)',
 '3. Which years do these financial regulations apply to: 2022 (Page 1)',
 '3. Which years do these financial regulations apply to: 2026 to 2030 seasons (Page 46)',
 '3. Which years do these financial regulations apply to: 2022 (Page 47)',
 '3. Which years do these financial regulations apply to: 2022 (Page 1)',
 '3. Which years do these financial regulations apply to: 2022 (Page 1)',
 '3. Which years do these financial regulations apply to: 2022 (Page 56)',
 '3. Which years do these financial regulations apply to: 2022 (Page 1)',
 '3. Which years do these financial regulations apply to: 2022 (Page 16)',
 '3. Which years do these financial regulations apply to: 2022 (Page 16)']

合并

我们已经能够安全地提取前两个答案，而第三个答案由于每页都出现的日期而变得复杂，尽管正确的答案也包含在其中。

要进一步调整，您可以考虑尝试：

更具描述性或更具体的提示
如果您有足够的训练数据，可以微调模型以很好地查找一组输出
分块数据的方式 - 我们选择了 1000 个 token 且没有重叠，但更智能的分块，将信息分成部分、按 token 或类似方式分割可能会获得更好的结果

但是，通过最少的调整，我们已经使用长文档的内容回答了 6 个不同难度的问题，并有了一种可重用的方法，可以应用于任何需要实体提取的长文档。期待看到您能用它做什么！