摘要与 Claude

引言

摘要是自然语言处理中的一项关键任务，它涉及将大量文本浓缩成更短、更易于理解的格式，同时保留关键信息。在当今信息丰富的世界中，从长篇文档中快速提取和综合要点并在各种行业和应用中都非常有价值。

本指南侧重于利用 Claude 的摘要能力，并特别强调法律文档。法律文档通常冗长且阅读起来很费力——尤其是在有大量细则和法律术语的情况下。我们将探讨有效摘要此类文档的技术、评估摘要质量的方法以及系统地提高摘要性能的策略。

我们将涵盖的关键方面包括：

制作有效的摘要提示
从文档中提取特定元数据
处理超出常规令牌限制的长文档
使用自动化方法评估摘要质量（例如，ROUGE 分数和 Promptfoo 自定义方法）
迭代改进摘要性能
关于如何优化摘要工作流程的通用结论性提示

在本指南结束时，您将对如何使用 Claude 实现和优化摘要任务有扎实的理解，并有一个框架可以将这些技术应用于您自己的特定用例。

在我们开始之前，值得谈谈本指南中的评估。评估摘要质量是一项艰巨的任务。与许多其他自然语言处理任务不同，摘要评估通常缺乏明确、客观的指标。该过程可能高度主观，不同的读者重视摘要的不同方面。像 ROUGE 分数这样的传统经验方法虽然有用，但在捕捉连贯性、事实准确性和相关性等细微差别方面存在局限性。此外，“最佳”摘要可能因具体用例、目标受众和所需的详细程度而异。尽管存在这些挑战，我们在本指南中探索了几种可以利用的方法，结合了自动化指标、正则表达式和特定任务的标准。在本指南中，我们认识到最有效的方法通常是根据手头的特定摘要任务量身定制技术组合。

设置

要完成本指南，您需要安装以下软件包：

anthropic
pypdf
pandas
matplotlib
sklearn
numpy
rouge-score
nltk
seaborn
promptfoo（用于评估）

您还需要一个 Anthropic API 密钥。

让我们开始安装所需的软件包并设置我们的环境：

# install packages
!pip install anthropic pypdf pandas matplotlib numpy rouge-score nltk seaborn --quiet

import os
import re
import anthropic
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from typing import List, Dict, Tuple
import json
import seaborn as sns

# Set up Anthropic client
# You can set up a .env file with your API key to keep it private, and import it like so:
# from dotenv import load_dotenv
# load_dotenv()

# or add your key directly
api_key = 'ANTHROPIC_API_KEY' # Replace ANTHROPIC_API_KEY with your actual API key
client = anthropic.Anthropic(api_key=api_key)

print("Setup complete!")

Setup complete!

数据准备

在我们开始摘要文档之前，我们需要准备数据。这包括从 PDF 中提取文本、清理文本并确保其已准备好输入到我们的语言模型中。为了演示的目的，我们已从 sec.gov 网站获取了一个公开可用的转租协议。

如果您有任何要在此处测试的 PDF，请随时将其导入此目录，然后更改下面的文件路径。如果您只想通过复制和粘贴使用文本块，请跳过此步骤并定义 text = <text blob>。

这是一组用于处理此过程的函数：

import pypdf
import re

pdf_path = "data/Sample Sublease Agreement.pdf"

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = pypdf.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

def clean_text(text):
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove page numbers
    text = re.sub(r'\n\s*\d+\s*\n', '\n', text)
    return text.strip()

def prepare_for_llm(text, max_tokens=180000):
    # Truncate text to fit within token limit (approximate)
    return text[:max_tokens * 4]  # Assuming average of 4 characters per token

def get_llm_text(path):
    extracted_text = extract_text_from_pdf(path)
    cleaned_text = clean_text(extracted_text)
    llm_ready_text = prepare_for_llm(cleaned_text)
    return llm_ready_text

# You can now use get_llm_text in your LLM prompt
text = get_llm_text(pdf_path)
print(text[:500])

EX-10.32 7 dex1032.htm SUBLEASE AGREEMENT Exhibit 10.32 SUBLEASE AGREEMENT THIS SUBLEASE AGREEMENT (“Sublease ”), is dated as of April 1, 2006, by and between COHEN BROTHERS, LLC d/b/a COHEN & COMP ANY (“Sublessor ”) and TABERNA CAPIT AL MANAGEMENT , LLC (“Sublessee ”), collectively , the “ Parties ” and each a “ Party ”. WHEREAS, Sublessor is the lessee under a written lease agreement dated June 22, 2005 wherein Brandywine Cira, L.P ., a Delaware limited partnership (“ Lessor ”), leased Suite N

此设置使我们能够轻松处理 PDF 文档并为摘要做好准备。在下一节中，我们将从基本的摘要方法开始，然后通过更高级的技术进行改进。

基本摘要

让我们从使用 Claude 的简单摘要函数开始。这是使用 Claude 摘要上述文档文本的简单尝试。随着本指南的进行，我们将改进此方法。

需要注意的一点是，虽然这看起来很简单，但我们实际上已经在利用 Claude 的一些重要功能。值得注意的一点是助手角色和停止序列的使用。助手前导语将 Claude 定位为在最后一个短语 <summary> 之后直接包含摘要。停止序列 </summary> 然后告诉 Claude 停止生成。这是我们将继续在本指南中使用的模式。

def basic_summarize(text, max_tokens=1000):

    # Prompt the model to summarize the text
    prompt = f"""Summarize the following text in bullet points. Focus on the main ideas and key details:
    {text}
    """

    response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=max_tokens,
            system="You are a legal analyst known for highly accurate and detailed summaries of legal documents.",
            messages=[
                {
                    "role": "user",
                    "content": prompt
                },
                {
                    "role": "assistant",
                    "content": "Here is the summary of the legal document: <summary>"
                }
            ],
            stop_sequences=["</summary>"]
        )

    return response.content[0].text

basic_response = basic_summarize(text, max_tokens=1000)

print(basic_response)

Key Points:
•Between parties: COHEN BROTHERS, LLC d/b/a COHEN & COMPANY (Sublessor) and TABERNA CAPITAL MANAGEMENT, LLC (Sublessee).
•Signed on April 1, 2006.
•Premises: 2,000 square feet of office space in Suite 1703 in the Cira Center at 2929 Arch Street, Philadelphia.

Major Terms:
•Term: 5 years starting April 1, 2006
•Payment: Fixed rent increases annually from $34.50/sf to $37.34/sf over the term.
•Utilities: Tenant pays for electricity and pro rata share of building expenses.
•Use: General office use only.
•Assignment/Subletting: Requires prior written consent of landlord.

Key Obligations:
•Tenant must maintain insurance including liability and property insurance.
•Tenant responsible for interior maintenance/condition of the premises.
•Tenant must comply with all building rules and regulations.
•Tenant must maintain premises in good order and repair.

Notable Provisions:
•Sublessor can recapture premises if tenant tries to assign/sublet without proper consent.
•Default provisions give sublessor multiple remedies including termination and accelerated rent.
•Tenant must indemnify landlord for claims related to tenant's use or actions.
•Tenant responsible for maintaining and repairing interior of premises.

This appears to be a fairly standard commercial office sublease with typical provisions regarding tenant obligations, insurance requirements, default remedies, etc. The sublessor retains significant control and remedies while the sublessee has standard obligations for an office tenant.

这种基本方法提供了简单的摘要，但对于法律或财务文件来说，它可能无法捕捉到我们所需的所有细微差别。正如您在上面重新运行单元格时注意到的那样，没有标准的、正式化的输出。相反，我们检索了文档的基本摘要，没有太多可供解析的结构化输出。这使得它更难阅读，更难信任（我们怎么知道它没有遗漏什么？）因此，在任何实际应用中使用它都更加困难。

让我们看看是否可以调整我们的提示以获得更结构化的摘要输出。

多轮基本摘要

我们能够如此快速地摘要大量文档，这很酷，但我们可以做得更好。让我们尝试在提示中添加一些示例，看看是否可以改进输出并在我们继续进行更高级的技术之前创建一些结构。

请注意，这里我们实际上并没有更改请求的实际格式，尽管我们附加了 2 个附加项：

我们告诉模型“不要前导”。在限制模型输出仅为我们想要的答案时，这通常是一个好主意，而没有您在使用 Claude 时可能熟悉的初始形式的对话角度。当我们稍后在本指南中使用其他“说明”时，这尤其重要。
我们附加了 3 个摘要文档的示例。这称为少样本或多样本学习，它可以帮助模型理解我们的需求。

让我们看看输出如何变化：

# We import from our data directory to save space in our notebook
from data.multiple_subleases import document1, document2, document3, sample1, sample2, sample3

def basic_summarize_multishot(text, max_tokens=1000):

    # Prompt the model to summarize the text
    prompt = f"""Summarize the following text in bullet points. Focus on the main ideas and key details:
        {text}

    Do not preamble.

    Use these examples for guidance in summarizing:

    <example1>
        <original1>
            {document1}
        </original1>

        <summary1>
            {sample1}
        </summary1>
    </example1>

    <example2>
        <original2>
            {document2}
        </original2>

        <summary2>
            {sample2}
        </summary2>
    </example2>

    <example3>
        <original3>
            {document3}
        </original3>

        <summary3>
            {sample3}
        </summary3>
    </example3>
    """

    response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=max_tokens,
            system="You are a legal analyst known for highly accurate and detailed summaries of legal documents.",
            messages=[
                {
                    "role": "user",
                    "content": prompt
                },
                {
                    "role": "assistant",
                    "content": "Here is the summary of the legal document: <summary>"
                }
            ],
            stop_sequences=["</summary>"]
        )

    return response.content[0].text

basic_multishot_response = basic_summarize_multishot(text, max_tokens=1000)

print(basic_multishot_response)

Description: This is a sublease agreement between Cohen Brothers, LLC (Sublessor) and Taberna Capital Management, LLC (Sublessee) for office space in Philadelphia.

<parties involved>
Sublessor: Cohen Brothers, LLC d/b/a Cohen & Company
Sublessee: Taberna Capital Management, LLC
Original lessor: Brandywine Cira, L.P.
</parties involved>

<property details> 
Address: 2929 Arch Street, Suite 1703, Philadelphia, PA
Description: 2,000 square feet of office space with access to file space, printers, copiers, kitchen, conference rooms
Permitted use: General office use
</property details>

<term and rent>
Start date: April 1, 2006
End date: 5 years from start date
Monthly rent:
• Months 1-12: $5,750
• Months 13-24: $5,865
• Months 25-36: $5,981.67
• Months 37-48: $6,101.67
• Months 49-60: $6,223.33
</term and rent>

<responsibilities>
Utilities: Not explicitly specified
Maintenance: Not explicitly specified
Repairs: Tenant responsible for damage repairs
Insurance: Tenant required to maintain liability insurance with $3M limit and workers compensation insurance
</responsibilities>

<special provisions>
Default: Detailed events of default and remedies specified
Holdover: Double rent for unauthorized holdover period
Assignment/Subletting: Not permitted without landlord consent
Alterations: Require landlord consent
Access to services: Includes file space, copiers, conference rooms, receptionist services
</special provisions>

如果您查看我们提供的示例，您会发现输出格式与此相同（请转到 data/<任何 .txt 文件> 查看）。这很有趣——我们没有明确告诉 Claude 遵循示例的格式，但它似乎也已经注意到了。这说明了少样本学习的力量，以及 Claude 如何从少数示例中泛化到新输入。

高级摘要技术

引导式摘要

引导式摘要是指我们明确定义一个框架供模型在其摘要任务中遵守。我们可以通过更改提示的详细信息来完成所有这些操作，以指导 Claude 的详细程度、包含或排除技术术语的程度，或者提供更高或更低级别的上下文摘要。对于法律文件，我们可以引导摘要关注特定方面。

请注意，我们很可能可以通过示例（我们在上面探讨过！）来完成与下面相同的格式化输出！

def guided_legal_summary(text, max_tokens=1000):

    # Prompt the model to summarize the text
    prompt = f"""Summarize the following legal document. Focus on these key aspects:

    1. Parties involved
    2. Main subject matter
    3. Key terms and conditions
    4. Important dates or deadlines
    5. Any unusual or notable clauses

    Provide the summary in bullet points under each category.

    Document text:
    {text}

    """

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=max_tokens,
        system="You are a legal analyst known for highly accurate and detailed summaries of legal documents.",
        messages=[
            {
                "role": "user",
                "content": prompt
            },
            {
                "role": "assistant",
                "content": "Here is the summary of the legal document: <summary>"
            }
        ],
        stop_sequences=["</summary>"]
    )

    return response.content[0].text

# Example usage
legal_summary = guided_legal_summary(text)
print(legal_summary)

1. Parties Involved
- Sublessor: Cohen Brothers, LLC d/b/a Cohen & Company
- Sublessee: Taberna Capital Management, LLC
- Original Landlord: Brandywine Cira, L.P. (Master Lease landlord)

2. Main Subject Matter
- Sublease agreement for Suite 1703 at Cira Centre, 2929 Arch Street, Philadelphia, PA
- 2,000 square feet of office space within the Master Premises of 13,777 rentable square feet
- Includes furniture, file space, printers, copiers, kitchen, conference room facilities and receptionist/secretarial services

3. Key Terms and Conditions
- Initial Term: 5 years from April 1, 2006 
- Fixed Rent: Escalating annual rent schedule starting at $34.50/sq ft in Year 1 ($69,000 annually) up to $37.34/sq ft in Year 5 ($74,680 annually)
- Pro rata share of operating expenses and utilities
- No assignment or subletting without Sublessor's prior written consent
- Sublessee takes premises "AS IS"
- Sublessee must maintain required insurance coverage
- Default provisions for non-payment, breach of lease terms, bankruptcy, etc.

4. Important Dates/Deadlines  
- Commencement Date: April 1, 2006
- Expiration Date: 5 years from Commencement Date
- Fixed Rent payable monthly in advance on 1st of each month
- 5-day grace period for late payments before default

5. Notable Clauses
- Indemnification requirements for both parties
- Holdover rent at 2x monthly rate if Sublessee remains after term ends
- Sublessor not liable for utilities/services interruption
- Sublessee responsible for any construction liens
- Confession of judgment provision
- Waiver of jury trial provision

这确实使得解析文档中最相关的部分并理解特定项目和重要条款的含义更加容易。

特定领域引导式摘要

您可以将上述引导式摘要提示应用于任何类型的文档，但通过针对特定文档类型进行定制，我们可以使其功能更强大。例如，如果我们知道我们正在处理一份转租协议，我们可以引导模型关注该特定文档类型最相关的法律术语和概念。当我们使用 Claude 处理特定用例并明确知道我们想要提取的最相关值时，这将是最相关的。

以下是我们如何修改转租协议的引导式摘要函数的一个示例。请注意，我们还将“模型”作为函数的附加参数，以便我们可以根据任务更轻松地为摘要选择不同的模型：

def guided_sublease_summary(text, model="claude-3-5-sonnet-20241022", max_tokens=1000):

    # Prompt the model to summarize the sublease agreement
    prompt = f"""Summarize the following sublease agreement. Focus on these key aspects:

    1. Parties involved (sublessor, sublessee, original lessor)
    2. Property details (address, description, permitted use)
    3. Term and rent (start date, end date, monthly rent, security deposit)
    4. Responsibilities (utilities, maintenance, repairs)
    5. Consent and notices (landlord's consent, notice requirements)
    6. Special provisions (furniture, parking, subletting restrictions)

    Provide the summary in bullet points nested within the XML header for each section. For example:

    <parties involved>

    - Sublessor: [Name]
    // Add more details as needed
    </parties involved>

    If any information is not explicitly stated in the document, note it as "Not specified". Do not preamble.

    Sublease agreement text:
    {text}

    """

    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        system="You are a legal analyst specializing in real estate law, known for highly accurate and detailed summaries of sublease agreements.",
        messages=[
            {
                "role": "user",
                "content": prompt
            },
            {
                "role": "assistant",
                "content": "Here is the summary of the sublease agreement: <summary>"
            }
        ],
        stop_sequences=["</summary>"]
    )

    return response.content[0].text

# Example usage
sublease_summary = guided_sublease_summary(text)
print(sublease_summary)

<parties involved>

- Sublessor: Cohen Brothers, LLC (d/b/a Cohen & Company) 
- Sublessee: Taberna Capital Management, LLC
- Original Lessor: Brandywine Cira, L.P. (Master Lease holder)
</parties involved>

<property details>

- Address: 2929 Arch Street, Suite 1703, Philadelphia, PA
- Description: 2,000 square feet of office space in Suite 1703 
- Permitted Use: General office use
- Includes: Access to file space, printers, copiers, kitchen, conference rooms, receptionist/secretarial services
</property details>

<term and rent>

- Start Date: April 1, 2006
- End Date: 5 years from commencement
- Monthly Rent: Escalating schedule starting at $5,750 in year 1 up to $6,223.33 in year 5
- Security Deposit: Not specified
</term and rent>

<responsibilities>

- Utilities: Sublessee pays proportional share of utilities and operating expenses
- Maintenance: Sublessor responsible for base building maintenance
- Repairs: Sublessee responsible for repairs due to its use
- Insurance: Sublessee must maintain general liability and property insurance
</responsibilities>

<consent and notices>

- Landlord's Consent: Required for assignment/subletting
- Notice Requirements: All notices must be in writing and delivered to specified addresses
- Sublessor's Consent: Required for alterations, improvements, signage
</consent and notices>

<special provisions>

- Furniture: Included in lease
- Parking: Not included
- Assignment: No assignment/subletting without Sublessor's consent
- Default Remedies: Specified remedies including termination and accelerated rent
</special provisions>

由于我们决定将摘要的每个部分都输出为 XML 标签，因此我们现在可以像这样单独解析它们（也可以通过 JSON 或任何其他格式完成）：

import re

def parse_sections_regex(text):
    pattern = r'<(.*?)>(.*?)</\1>'
    matches = re.findall(pattern, text, re.DOTALL)

    parsed_sections = {}
    for tag, content in matches:
        items = [item.strip('- ').strip() for item in content.strip().split('\n') if item.strip()]
        parsed_sections[tag] = items

    return parsed_sections


# Parse the sections
parsed_sections = parse_sections_regex(sublease_summary)

# Check if parsing was successful
if isinstance(parsed_sections, dict) and 'parties involved' in parsed_sections:
    print("Parties involved:")
    for item in parsed_sections['parties involved']:
        print(f"- {item}")
else:
    print("Error: Parsing failed or 'parties involved' section not found.")
    print("Parsed result:", parsed_sections)

Parties involved:

- Sublessor: Cohen Brothers, LLC (d/b/a Cohen & Company)
- Sublessee: Taberna Capital Management, LLC
- Original Lessor: Brandywine Cira, L.P. (Master Lease holder)

包含多个文档的上下文（元摘要）

如果我们有许多与同一客户相关的文档怎么办？我们可以使用分块方法来处理此问题。这是一种将文档分解为更小、可管理块，然后单独处理每个块的技术。然后，我们可以合并每个块的摘要，以创建整个文档的元摘要。当我们想要摘要大量文档或想要摘要单个非常长的文档时，这可能特别有用。

以下是我们如何执行此操作的示例：

from data.multiple_subleases import document1, document2, document3

def chunk_text(text, chunk_size=2000):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

def summarize_long_document(text, max_tokens=2000):

    chunks = chunk_text(text)

    # Iterate over chunks and summarize each one
    # We use guided_legal_summary here, but you can use basic_summarize or any other summarization function
    # Note that we'll also use haiku for the interim summaries, and the 3.5 sonnet for the final summary
    chunk_summaries = [guided_sublease_summary(chunk, model="claude-3-haiku-20240307", max_tokens=max_tokens) for chunk in chunks]

    final_summary_prompt = f"""

    You are looking at the chunked summaries of multiple documents that are all related. Combine the following summaries of the document from different truthful sources into a coherent overall summary:

    {"".join(chunk_summaries)}

    1. Parties involved (sublessor, sublessee, original lessor)
    2. Property details (address, description, permitted use)
    3. Term and rent (start date, end date, monthly rent, security deposit)
    4. Responsibilities (utilities, maintenance, repairs)
    5. Consent and notices (landlord's consent, notice requirements)
    6. Special provisions (furniture, parking, subletting restrictions)

    Provide the summary in bullet points nested within the XML header for each section. For example:

    <parties involved>

    - Sublessor: [Name]
    // Add more details as needed
    </parties involved>

    If any information is not explicitly stated in the document, note it as "Not specified".

    Summary:
    """

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=max_tokens,
        system="You are a legal expert that summarizes notes on one document.",
        messages=[
            {
                "role": "user",
                "content": final_summary_prompt
            },
            {
                "role": "assistant",
                "content": "Here is the summary of the legal document: <summary>"
            }
        ],
        stop_sequences=["</summary>"]
    )

    return response.content[0].text

# Example usage
# combine 3 documents (all related) together
text = document1 + document2 + document3
long_summary = summarize_long_document(text)
print(long_summary)

<parties involved>

- Sublessor: Apex Innovations, Inc. (Delaware corporation) later identified as TechHub Enterprises, LLC
- Sublessee: NanoSphere Solutions, Inc. and Quantum Dynamics, LLC (California LLC)
- Original Lessor: Innovate Properties, LLP
</parties involved>

<property details>

- Address: 9876 Innovation Park, Building C, San Francisco, CA 94107
- Description: Approximately 15,000-25,000 square feet of office and laboratory space
- Permitted Use: General office purposes, research and development, and laboratory uses consistent with BSL-2 facility requirements
</property details>

<term and rent>

- Start Date: September 1, 2023
- End Date: August 31, 2026 (with option to extend for 3-5 additional years)
- Monthly Rent: Starting at $75,000/month with annual 3% increases
- Security Deposit: $450,000-$787,500
</term and rent>

<responsibilities>

- Utilities: Sublessee responsible for all utilities and services, including electricity, gas, water, sewer, telephone, internet, and janitorial
- Maintenance: Sublessee responsible for interior maintenance, repairs and replacements, including walls, floors, ceilings, doors, windows, fixtures
- Repairs: Sublessee responsible for repairs except building structure, exterior walls, roof which are Sublessor's responsibility
</responsibilities>

<consent and notices>

- Landlord's Consent: Required for assignments, subletting, alterations
- Notice Requirements: 30 days written notice for defaults, insurance changes; 9-12 months notice for term extensions
</consent and notices>

<special provisions>

- Furniture: Right to install furniture/equipment 15-30 days before commencement
- Parking: Non-exclusive right to use common parking facilities
- Subletting Restrictions: No assignment/subletting without Sublessor's consent, except to affiliated entities
- Additional: Hazardous materials restrictions, OFAC compliance requirements, jury trial waiver
</special provisions>

摘要索引文档：一种高级 RAG 方法

摘要索引文档是一种先进的检索增强生成 (RAG) 方法，它在文档级别运行。

与传统的 RAG 技术相比，此方法具有许多优势，尤其是在涉及大型文档或需要精确信息检索的情况下。

工作原理

文档摘要：为语料库中的每个文档生成简洁的摘要（查询子集文本并快速摘要）。
上下文窗口优化：确保所有摘要都适合语言模型的上下文窗口。
相关性评分：要求模型对每个摘要与正在查询的相关性进行排名。
重新排名（可选）：应用重新排名技术以进一步优化和压缩 top-K 结果。
回答当前查询。

这种方法有一些明显的优势：

更高效的文档检索排名方式，使用的上下文比传统 RAG 方法少。
在特定任务上表现更优：优于其他 RAG 方法，始终将正确文档排名第一。
优化的信息检索：重新排名有助于压缩结果，确保向模型呈现最简洁、最相关的信息。

class LegalSummaryIndexedDocuments:

    def __init__(self, client):
        self.client = client # Claude client
        self.documents: List[Dict[str, str]] = [] # List of docs to store
        self.summaries: List[str] = []

    def add_document(self, doc_id: str, content: str):
        # Adds a document to the index
        self.documents.append({"id": doc_id, "content": content})

    def generate_summaries(self):
        # Generates summaries for all documents in the index
        for doc in self.documents:
            summary = self._generate_legal_summary(doc["content"])
            self.summaries.append(summary)

    def _generate_legal_summary(self, content: str) -> str:

        # Note how we constrain the content to a maximum of 2000 words. We do this because we don't need that much information for the intial ranking.
        prompt = f"""
        Summarize the following sublease agreement. Focus on these key aspects:

        1. Parties involved (sublessor, sublessee, original lessor)
        2. Property details (address, description, permitted use)
        3. Term and rent (start date, end date, monthly rent, security deposit)
        4. Responsibilities (utilities, maintenance, repairs)
        5. Consent and notices (landlord's consent, notice requirements)
        6. Special provisions (furniture, parking, subletting restrictions)

        Provide the summary in bullet points nested within the XML header for each section. For example:

        <parties involved>

        - Sublessor: [Name]
        // Add more details as needed
        </parties involved>

        If any information is not explicitly stated in the document, note it as "Not specified".

        Sublease agreement text:
        {content[:2000]}...

        Summary:
        """

        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=500,
            temperature=0.2,
            messages=[{"role": "user", "content": prompt},
                      {"role": "assistant", "content": "Here is the summary of the legal document: <summary>"}],
            stop_sequences=["</summary>"]        
        )
        return response.content[0].text

    def rank_documents(self, query: str, top_k: int = 3) -> List[Tuple[str, float]]:
        """
        Rank documents based on their relevance to the given query.
        We use Haiku here as a cheaper, faster model for ranking. 
        """
        ranked_scores = []
        for summary in self.summaries:

            prompt=f"Legal document summary: {summary}\n\nLegal query: {query}\n\nRate the relevance of this legal document to the query on a scale of 0 to 10. Only output the numeric value:"

            response = client.messages.create(
                model="claude-3-haiku-20240307",
                max_tokens=2,
                temperature=0,
                messages=[
                    {"role": "user", "content": prompt}
                ]
            )
            ranked_score = float(response.content[0].text)
            ranked_scores.append(ranked_score)

        ranked_indices = np.argsort(ranked_scores)[::-1][:top_k]
        return [(self.documents[i]["id"], ranked_scores[i]) for i in ranked_indices]

    def extract_relevant_clauses(self, doc_id: str, query: str) -> List[str]:
        """
        Extracts relevant clauses from a document based on a query.
        """
        doc_content = next(doc["content"] for doc in self.documents if doc["id"] == doc_id)

        prompt = f"""
        Given the following legal query and document content, extract the most relevant clauses or sections and write the answer to the query. 
        Provide each relevant clause or section separately, preserving the original legal language:

        Legal query: {query}

        Document content: {doc_content}...

        Relevant clauses or sections (separated by '---'):"""

        response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1000,
                temperature=0,
                messages=[
                    {"role": "user", "content": prompt}
                ]
            )

        clauses = re.split(r'\n\s*---\s*\n', response.content[0].text.strip())
        return [clause.strip() for clause in clauses if clause.strip()]

from data.multiple_subleases import document1, document2, document3

lsid = LegalSummaryIndexedDocuments(client=client)

# Add documents
lsid.add_document("doc1", document1)
lsid.add_document("doc2", document2)
lsid.add_document("doc3", document3)

# Generate summaries - this would happen at ingestion
lsid.generate_summaries()

# Rank documents for a legal query
legal_query = "What contract is for the sublessor Apex Innovations, LLC?"
ranked_results = lsid.rank_documents(legal_query)

print("Initial ranking:", ranked_results)

# Extract relevant clauses from the top-ranked document
top_doc_id = ranked_results[0][0]
relevant_clauses = lsid.extract_relevant_clauses(top_doc_id, legal_query)

print("\nRelevant clauses from the top-ranked document:")
for i, clause in enumerate(relevant_clauses[1:], 1):
    print(f"Clause {i}: {clause}")

Initial ranking: [('doc1', 8.0), ('doc3', 0.0), ('doc2', 0.0)]

Relevant clauses from the top-ranked document:
Clause 1: COMMERCIAL SUBLEASE AGREEMENT

THIS COMMERCIAL SUBLEASE AGREEMENT (hereinafter referred to as the "Sublease") is made and entered into on this 15th day of August, 2023 (the "Effective Date"), by and between:

SUBLESSOR: Apex Innovations, Inc., a Delaware corporation with its principal place of business at 1234 Tech Boulevard, Suite 5000, San Francisco, CA 94105 (hereinafter referred to as the "Sublessor")
Clause 2: WHEREAS, Sublessor is the Tenant under that certain Master Lease Agreement dated January 1, 2020 (hereinafter referred to as the "Master Lease"), wherein Innovate Properties, LLP (hereinafter referred to as the "Master Lessor") leased to Sublessor those certain premises consisting of approximately 50,000 square feet of office space located at 9876 Innovation Park, Building C, Floors 10-12, San Francisco, CA 94107 (hereinafter referred to as the "Master Premises");
Clause 3: Answer: There appears to be an error in the legal query. The query refers to "Apex Innovations, LLC" but the document shows that the sublessor is actually "Apex Innovations, Inc.", a Delaware corporation. The contract for Apex Innovations, Inc. is:

1. A Commercial Sublease Agreement dated August 15, 2023, where they are the Sublessor
2. A Master Lease Agreement dated January 1, 2020, where they are the Tenant under Innovate Properties, LLP

摘要 RAG 的最佳实践

最佳摘要长度：尝试不同的摘要长度，以在简洁性和信息量之间找到平衡。
迭代重新排名：考虑多轮重新排名以获得更精确的结果，尤其是在处理大型文档集时。
缓存：为摘要和初始排名实现缓存机制，以提高重复查询的性能。

摘要索引文档提供了一种强大的 RAG 方法，尤其是在涉及大型文档或需要精确信息检索的情况下表现出色。通过利用文档摘要、对数概率评分和可选的重新排名，此方法提供了一种高效有效的方式来检索和呈现相关信息给语言模型。

评估

正如本食谱引言中所提到的，评估摘要质量是一项艰巨的工作。这是因为摘要文档有多种方法，并且不同的摘要可能同样有效。根据用例的不同，摘要的不同方面可能更重要或更不重要。

您可以在此处阅读有关我们提示工程的经验方法此处。使用 Jupyter Notebook 是开始提示工程的好方法，但随着数据集的增长和提示数量的增加，利用能够与您一起扩展的工具非常重要。

在本指南的这一部分，我们将探讨使用 Promptfoo，一个开源 LLM 评估工具包。要开始，请前往 ./evaluation 目录并查看 ./evaluation/README.md。

成功运行评估后，请返回此处查看结果。您也可以使用命令 npx promptfoo@latest view 以动态方式查看结果，在创建一些结果之后。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

%matplotlib inline
plt.style.use('seaborn')

# Load the data
df = pd.read_csv('data/results.csv')

# Function to extract PASS/FAIL and score
def extract_result(text):
    match = re.search(r'\[(PASS|FAIL)\]\s*\((\d+\.\d+)\)', str(text))
    if match:
        return match.group(1), float(match.group(2))
    return 'UNKNOWN', 0.0

# Apply the extraction to relevant columns
for col in df.columns[2:]:
    df[f'{col}_result'], df[f'{col}_score'] = zip(*df[col].apply(extract_result))

# Prepare data for grouped accuracy score
models = ['3.5 Sonnet', '3.0 Haiku']
prompts = ['basic_summarize', 'guided_legal_summary', 'summarize_long_document']

results = []
for model in models:
    for prompt in prompts:
        col = f'[{model}] prompts.py:{prompt}_result'
        if col in df.columns:
            pass_rate = (df[col] == 'PASS').mean()
            results.append({'Model': model, 'Prompt': prompt, 'Pass Rate': pass_rate})

result_df = pd.DataFrame(results)

# 1. Grouped bar chart for accuracy scores
plt.figure(figsize=(12, 6))
result_pivot = result_df.pivot(index='Prompt', columns='Model', values='Pass Rate')
result_pivot.plot(kind='bar')
plt.title('Pass Rate by Model and Prompt')
plt.ylabel('Pass Rate')
plt.legend(title='Model')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# 2. Box plot of scores
plt.figure(figsize=(8, 8))
score_cols = [col for col in df.columns if col.endswith('_score')]
score_data = df[score_cols].melt()
sns.boxplot(x='variable', y='value', data=score_data)
plt.title('Distribution of Scores')
plt.xticks(rotation=90)
plt.xlabel('Model and Prompt')
plt.ylabel('Score')
plt.tight_layout()
plt.show()

# Display summary statistics
summary_stats = df[[col for col in df.columns if col.endswith('_score')]].describe()
display(summary_stats)

/var/folders/c8/rjj6d5_15tj4qh_zhlnz9xxr0000gp/T/ipykernel_58010/2701104606.py:7: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn')



<Figure size 1200x600 with 0 Axes>

png

	[3.0 Haiku] prompts.py:basic_summarize_score	[3.0 Haiku] prompts.py:guided_legal_summary_score	[3.0 Haiku] prompts.py:summarize_long_document_score	[3.5 Sonnet] prompts.py:basic_summarize_score	[3.5 Sonnet] prompts.py:guided_legal_summary_score	[3.5 Sonnet] prompts.py:summarize_long_document_score
count	9.000000	9.000000	9.000000	9.000000	9.000000	9.000000
mean	1.423333	1.443333	1.522222	1.088889	1.330000	1.475556
std	0.135647	0.146969	0.092030	0.535547	0.086458	0.285750
min	1.190000	1.270000	1.440000	0.000000	1.230000	0.750000
25%	1.400000	1.300000	1.460000	1.210000	1.280000	1.450000
50%	1.440000	1.450000	1.460000	1.290000	1.310000	1.600000
75%	1.510000	1.490000	1.630000	1.400000	1.340000	1.640000
max	1.600000	1.660000	1.660000	1.490000	1.480000	1.650000

从结果来看，我们的最佳表现者似乎是 3.5 Sonnet，在所有评估中通过率为 66%，仅失败了 3 个测试（当一个测试失败时，它就被视为失败）。而这仅仅是开始——我们使用的是完全虚构的数据，这些数据要么是（a）由 Claude 生成的，要么是（b）取自 SEC gov 网站。当我们拥有真实数据时，我们可以做得更好，因为我们对我们正在处理的特定问题集有了更多的了解。

迭代改进

当我们进一步查看评估结果时，仍有改进的空间。这就是提示工程的迭代部分发挥作用的地方。以下是我们为改进结果可以采取的一些步骤：

分析 Promptfoo 结果以确定优势和劣势——例如，我们的包含评估似乎经常失败。这可能是因为一些文档不包含 XML 标签所需的信息。如果我们想准确评估性能，应该改进此评估（但这只是一个示例！）。
优化提示以解决特定问题（例如，提高简洁性或完整性）——我们看到多轮是最初非常好的尝试。这是我们应该结合一些高级技术来进一步提高性能的内容。
尝试长文档的不同分块策略。
微调温度和 max_tokens 参数。
实现后处理步骤以增强摘要质量。

结论和最佳实践

在本指南中，我们涵盖了一系列使用 Claude 摘要文档的技术，重点关注法律文档。构建完美的摘要系统和摘要评估框架是一门艺术：它需要结合这些方法才能成功。正如我们一开始提到的，摘要是一个非常主观的话题，但我们已经很好地尝试寻找可行的评估方法，并对我们的结果感到满意。请始终记住——您不是在将结果与 100% 的准确性进行基准测试。您是在与您自己执行这项复杂任务的能力进行基准测试；借助本指南所示的 Claude 的速度和效率，您可以开始真正体会到这种方法论的好处，从而将时间用于真正的决策。

总结一下这里的建议，我们包含了一些最佳实践供您参考：

制作清晰具体的提示。使用“不要前导”之类的短语来限制输出。
使用至少 2 个示例。
对特定领域的文档使用引导式摘要。
为长文档实施有效的高级策略。
定期评估和优化您的方法。
考虑 AI 生成摘要的伦理影响和局限性。