缓存提示（Prompt caching）通过 Anthropic API

提示缓存允许您存储和重用提示中的上下文。这使得在提示中包含额外信息更加实用，例如详细的说明和示例响应，这些信息有助于改进 Claude 生成的每个响应。

此外，通过在提示中充分利用提示缓存，您可以将延迟时间缩短一倍以上，并将成本降低高达 90%。在构建涉及详细书籍内容的重复任务的解决方案时，这可以节省大量成本。

在本指南中，我们将演示如何在单次交互和多轮对话中使用提示缓存。

设置

首先，让我们设置好环境，进行必要的导入和初始化：

%pip install anthropic bs4 --quiet

注意：您可能需要重启内核才能使用更新的软件包。

import anthropic
import time
import requests
from bs4 import BeautifulSoup

client = anthropic.Anthropic()
MODEL_NAME = "claude-3-5-sonnet-20241022"

现在，让我们获取一些文本内容以供示例使用。我们将使用简·奥斯汀的《傲慢与偏见》中的文本，该文本约有 187,000 个 token。

def fetch_article_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # 删除 script 和 style 元素
    for script in soup(["script", "style"]):
        script.decompose()

    # 获取文本
    text = soup.get_text()

    # 按行拆分并去除每行首尾的空格
    lines = (line.strip() for line in text.splitlines())
    # 将多行标题拆分为单独的行
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # 删除空行
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text

# 获取文章内容
book_url = "https://www.gutenberg.org/cache/epub/1342/pg1342.txt"
book_content = fetch_article_content(book_url)

print(f"已获取 {len(book_content)} 个字符。")
print("前 500 个字符：")
print(book_content[:500])

已获取 737525 个字符。
前 500 个字符：
The Project Gutenberg eBook of Pride and Prejudice
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.
Title:

示例 1：单次交互

让我们通过一个大型文档来演示提示缓存，比较缓存和非缓存 API 调用的性能和成本。

第 1 部分：非缓存 API 调用

首先，我们进行一次非缓存 API 调用。这将把提示加载到缓存中，以便我们后续的缓存 API 调用可以从提示缓存中受益。

我们将要求一个简短的输出字符串以保持较低的输出响应时间，因为提示缓存的好处仅适用于输入处理时间。

def make_non_cached_api_call():
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "<book>" + book_content + "</book>",
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": "这本书的标题是什么？只输出标题。"
                }
            ]
        }
    ]

    start_time = time.time()
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=300,
        messages=messages,
        extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}

    )
    end_time = time.time()

    return response, end_time - start_time

non_cached_response, non_cached_time = make_non_cached_api_call()

print(f"非缓存 API 调用时间：{non_cached_time:.2f} 秒")
print(f"非缓存 API 调用输入 token：{non_cached_response.usage.input_tokens}")
print(f"非缓存 API 调用输出 token：{non_cached_response.usage.output_tokens}")

print("\n摘要（非缓存）：")
print(non_cached_response.content)

非缓存 API 调用时间：20.37 秒
非缓存 API 调用输入 token：17
非缓存 API 调用输出 token：8

摘要（非缓存）：
[TextBlock(text='Pride and Prejudice', type='text')]

第 2 部分：缓存 API 调用

现在，我们进行一次缓存 API 调用。我将在 content 对象中添加 "cache_control": {"type": "ephemeral"} 属性，并在请求中添加 "prompt-caching-2024-07-31" beta 标头。这将为此次 API 调用启用提示缓存。

为了保持输出延迟恒定，我们将向 Claude 提出与之前相同的问题。请注意，此问题不属于缓存内容。

def make_cached_api_call():
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "<book>" + book_content + "</book>",
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": "这本书的标题是什么？只输出标题。"
                }
            ]
        }
    ]

    start_time = time.time()
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=300,
        messages=messages,
        extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
    )
    end_time = time.time()

    return response, end_time - start_time

cached_response, cached_time = make_cached_api_call()

print(f"缓存 API 调用时间：{cached_time:.2f} 秒")
print(f"缓存 API 调用输入 token：{cached_response.usage.input_tokens}")
print(f"缓存 API 调用输出 token：{cached_response.usage.output_tokens}")

print("\n摘要（缓存）：")
print(cached_response.content)

缓存 API 调用时间：2.92 秒
缓存 API 调用输入 token：17
缓存 API 调用输出 token：8

摘要（缓存）：
[TextBlock(text='Pride and Prejudice', type='text')]

如您所见，缓存 API 调用的总时间仅为 3.64 秒，而非缓存 API 调用为 21.44 秒。由于缓存，整体延迟有了显著改善。

示例 2：带增量缓存的多轮对话

现在，让我们来看一个多轮对话，在对话进行过程中添加缓存断点。

class ConversationHistory:
    def __init__(self):
        # 初始化一个空列表来存储对话轮次
        self.turns = []

    def add_turn_assistant(self, content):
        # 将助手的轮次添加到对话历史记录中
        self.turns.append({
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": content
                }
            ]
        })

    def add_turn_user(self, content):
        # 将用户的轮次添加到对话历史记录中
        self.turns.append({
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": content
                }
            ]
        })

    def get_turns(self):
        # 以特定格式检索对话轮次
        result = []
        user_turns_processed = 0
        # 反向迭代轮次
        for turn in reversed(self.turns):
            if turn["role"] == "user" and user_turns_processed < 1:
                # 添加最后一个用户轮次，并设置临时缓存控制
                result.append({
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": turn["content"][0]["text"],
                            "cache_control": {"type": "ephemeral"}
                        }
                    ]
                })
                user_turns_processed += 1
            else:
                # 按原样添加其他轮次
                result.append(turn)
        # 按原始顺序返回轮次
        return list(reversed(result))

# 初始化对话历史记录
conversation_history = ConversationHistory()

# 包含书籍内容的系统消息
# 注意：'book_content' 应在代码的其他地方定义
system_message = f"<file_contents> {book_content} </file_contents>"

# 我们模拟的预定义问题
questions = [
    "这部小说的标题是什么？",
    "本内特先生和本内特太太是谁？",
    "什么是内瑟菲尔德公园？",
    "这部小说的主题是什么？"
]

def simulate_conversation():
    for i, question in enumerate(questions, 1):
        print(f"\n第 {i} 轮：")
        print(f"用户：{question}")

        # 将用户输入添加到对话历史记录中
        conversation_history.add_turn_user(question)

        # 记录开始时间以进行性能测量
        start_time = time.time()

        # 向助手发出 API 调用
        response = client.messages.create(
            model=MODEL_NAME,
            extra_headers={
              "anthropic-beta": "prompt-caching-2024-07-31"
            },
            max_tokens=300,
            system=[
                {"type": "text", "text": system_message, "cache_control": {"type": "ephemeral"}},
            ],
            messages=conversation_history.get_turns(),
        )

        # 记录结束时间
        end_time = time.time()

        # 提取助手的回复
        assistant_reply = response.content[0].text
        print(f"助手：{assistant_reply}")

        # 打印 token 使用信息
        input_tokens = response.usage.input_tokens
        output_tokens = response.usage.output_tokens
        input_tokens_cache_read = getattr(response.usage, 'cache_read_input_tokens', '---')
        input_tokens_cache_create = getattr(response.usage, 'cache_creation_input_tokens', '---')
        print(f"用户输入 token：{input_tokens}")
        print(f"输出 token：{output_tokens}")
        print(f"输入 token（缓存读取）：{input_tokens_cache_read}")
        print(f"输入 token（缓存写入）：{input_tokens_cache_create}")

        # 计算并打印经过的时间
        elapsed_time = end_time - start_time

        # 计算缓存的输入提示的百分比
        total_input_tokens = input_tokens + (int(input_tokens_cache_read) if input_tokens_cache_read != '---' else 0)
        percentage_cached = (int(input_tokens_cache_read) / total_input_tokens * 100 if input_tokens_cache_read != '---' and total_input_tokens > 0 else 0)

        print(f"{percentage_cached:.1f}% 的输入提示已缓存（{total_input_tokens} 个 token）")
        print(f"耗时：{elapsed_time:.2f} 秒")

        # 将助手的回复添加到对话历史记录中
        conversation_history.add_turn_assistant(assistant_reply)

# 运行模拟对话
simulate_conversation()

第 1 轮：
用户：这部小说的标题是什么？
助手：这部小说的标题是简·奥斯汀的《傲慢与偏见》。
用户输入 token：4
输出 token：22
输入 token（缓存读取）：0
输入 token（缓存写入）：187354
0.0% 的输入提示已缓存（4 个 token）
耗时：20.37 秒

第 2 轮：
用户：本内特先生和本内特太太是谁？
助手：本内特先生和本内特太太是《傲慢与偏见》中五个女儿（简、伊丽莎白、玛丽、凯蒂和莉迪亚）的父母。

本内特先生是一位聪明但疏离的父亲，他经常躲进书房以避开妻子的戏剧化行为。他具有讽刺的幽默感，并且常常对包括自己家人在内的他人的愚蠢行为感到好笑。他对他的第二个女儿伊丽莎白尤其喜爱，伊丽莎白也和他一样头脑敏锐、幽默风趣。

本内特太太是一位主要专注于将她的五个女儿嫁给富裕男人的女性。她被描述为“神经衰弱”，经常焦虑、戏剧化且有些愚蠢。她一生中的主要目标是看到她的女儿们嫁得好，特别是因为她们家的地产已经传给了男性继承人（科林斯先生），这意味着本内特先生去世后，她的女儿们将几乎没有经济保障。她经常被描述为缺乏成熟和良好的判断力，这有时会让她更明智的女儿们，特别是伊丽莎白感到尴尬。

他们的婚姻被描绘成不匹配的婚姻，本内特先生年轻时因为妻子的美貌而娶了她，却发现他们在智力和性格上不兼容。这成为了关于不考虑性格兼容性而结婚的警示故事。
用户输入 token：4
输出 token：297
输入 token（缓存读取）：187354
输入 token（缓存写入）：36
100.0% 的输入提示已缓存（187358 个 token）
耗时：7.53 秒

第 3 轮：
用户：什么是内瑟菲尔德公园？
助手：内瑟菲尔德公园是小说中贝内特家朗伯恩附近的庄园。当富有的年轻人宾利先生搬到附近并租下它时，它对情节变得很重要。

宾利先生的到来促使了小说情节的大部分发展，因为他很快就对贝内特家长女简·贝内特产生了浪漫的兴趣。也是通过内瑟菲尔德，伊丽莎白·贝内特第一次遇到了达西先生，他是宾利先生的朋友，也住在这里。

内瑟菲尔德是小说中几个关键场景的重要地点，包括：

- 达西先生第一次怠慢伊丽莎白的舞会
- 简生病并随后留在内瑟菲尔德（伊丽莎白来照顾她）
- 贝内特一家与宾利-达西一行人之间的各种社交互动

内瑟菲尔德在小说中象征着财富和社会地位，宾利先生的入住代表着贝内特家族通过婚姻获得社会和经济地位的可能性。当宾利先生突然离开内瑟菲尔德时，这给贝内特家族的希望带来了巨大的失望和干扰，特别是对简而言。
用户输入 token：4
输出 token：289
输入 token（缓存读取）：187390
输入 token（缓存写入）：308
100.0% 的输入提示已缓存（187394 个 token）
耗时：6.76 秒

第 4 轮：
用户：这部小说的主题是什么？
助手：小说《傲慢与偏见》的主题是傲慢与偏见在人际关系中的相互作用，特别是通过伊丽莎白·贝内特和达西先生之间的中心浪漫关系来体现。然而，还有几个重要的相关主题：

1. 傲慢与偏见：
- 达西因其社会地位而产生的傲慢最初使他显得傲慢和轻蔑
- 伊丽莎白基于第一印象和威克姆的虚假叙述而对达西产生的偏见
- 两位主角都必须克服这些缺点才能找到幸福

2. 婚姻与社会阶层：
- 年轻女性为了经济保障而结婚的压力
- 为爱而结婚与为社会地位而结婚之间的冲突
- 小说描绘了不同类型的婚姻（伊丽莎白/达西、简/宾利、莉迪亚/威克姆、夏洛特/科林斯）

3. 声誉与社会期望：
- 声誉在摄政社会中的重要性
- 行为如何影响家族荣誉
- 这一时期对女性的限制

4. 个人成长与自我认知：
- 伊丽莎白和达西都学会认识到自己的缺点
- 克服第一印象的重要性
- 通过经验和反思实现的角色发展

5. 家庭与社会：
- 家庭关系在决定社会地位中的作用
- 家庭行为对个人前景的影响
- 平衡
用户输入 token：4
输出 token：300
输入 token（缓存读取）：187698
输入 token（缓存写入）：301
100.0% 的输入提示已缓存（187702 个 token）
耗时：7.13 秒

如您所见，在此示例中，在初始缓存设置后，响应时间从近 24 秒减少到仅 7-11 秒，同时在答案中保持了相同的质量水平。剩余的大部分延迟是由于生成响应所需的时间，这不受提示缓存的影响。

而且，由于在后续轮次中我们几乎缓存了 100% 的输入 token，并不断调整缓存断点，因此我们几乎可以即时读取下一个用户消息。