As requested, here is the full translation of the provided text:
使用 Claude 和工具使用功能提取结构化 JSON
在本指南中,我们将探讨使用 Claude 和工具使用功能从不同类型输入中提取结构化 JSON 数据的各种示例。我们将定义自定义工具,提示 Claude 为诸如摘要、实体提取、情感分析等任务生成结构良好的 JSON 输出。
如果您想在不使用工具的情况下获取结构化 JSON 数据,请查看我们的“如何启用 JSON 模式”指南。
设置环境
首先,让我们安装所需的库并设置 Anthropic API 客户端。
%pip install anthropic requests beautifulsoup4
from anthropic import Anthropic
import requests
from bs4 import BeautifulSoup
import json
client = Anthropic()
MODEL_NAME = "claude-3-haiku-20240307"
示例 1:文章摘要
在此示例中,我们将使用 Claude 为文章生成 JSON 摘要,其中包括作者、主题、摘要、连贯性得分、说服力得分和反驳点等字段。
tools = [
{
"name": "print_summary",
"description": "打印文章摘要。",
"input_schema": {
"type": "object",
"properties": {
"author": {"type": "string", "description": "文章作者姓名"},
"topics": {
"type": "array",
"items": {"type": "string"},
"description": '主题数组,例如 ["科技", "政治"]。应尽可能具体,并且可以重叠。'
},
"summary": {"type": "string", "description": "文章摘要。最多一两段。"},
"coherence": {"type": "integer", "description": "文章要点连贯性,0-100(含)"},
"persuasion": {"type": "number", "description": "文章的说服力得分,0.0-1.0(含)"}
},
"required": ['author', 'topics', 'summary', 'coherence', 'persuasion', 'counterpoint']
}
}
]
url = "https://www.anthropic.com/news/third-party-testing"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
article = " ".join([p.text for p in soup.find_all("p")])
query = f"""
<article>
{article}
</article>
使用 `print_summary` 工具。
"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
messages=[{"role": "user", "content": query}]
)
json_summary = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_summary":
json_summary = content.input
break
if json_summary:
print("JSON 摘要:")
print(json.dumps(json_summary, indent=2))
else:
print("未在响应中找到 JSON 摘要。")
JSON 摘要:
{
"author": "Anthropic",
"topics": [
"AI policy",
"AI safety",
"third-party testing"
],
"summary": "文章认为,人工智能行业需要对前沿人工智能系统进行有效的第三方测试,以避免社会危害,无论是故意的还是意外的。它讨论了第三方测试是什么样的、为什么需要它,以及 Anthropic 为达成此政策立场所做的研究。文章指出,这种测试制度是必要的,因为像大规模生成模型这样的前沿人工智能系统无法很好地适应用例和特定行业的框架,并且可能带来严重滥用或人工智能事故的风险。尽管 Anthropic 和其他组织已经实施了自我治理体系,但文章认为,行业范围内的第三方测试最终是广泛信任所必需的。文章概述了有效的第三方测试制度的关键组成部分,包括识别国家安全风险,并讨论了如何通过多元化的组织生态系统来实现。Anthropic 计划倡导为人工智能测试和评估增加更多资金和公共部门基础设施,并开发特定能力测试。",
"coherence": 90,
"persuasion": 0.8
}
示例 2:命名实体识别
在此示例中,我们将使用 Claude 对给定文本执行命名实体识别,并以结构化 JSON 格式返回实体。
tools = [
{
"name": "print_entities",
"description": "打印提取的命名实体。",
"input_schema": {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "提取的实体名称。"},
"type": {"type": "string", "description": "实体类型(例如,PERSON、ORGANIZATION、LOCATION)。"},
"context": {"type": "string", "description": "实体在文本中出现的上下文。"}
},
"required": ["name", "type", "context"]
}
}
},
"required": ["entities"]
}
}
]
text = "John works at Google in New York. He met with Sarah, the CEO of Acme Inc., last week in San Francisco."
query = f"""
<document>
{text}
</document>
使用 print_entities 工具。
"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
messages=[{"role": "user", "content": query}]
)
json_entities = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_entities":
json_entities = content.input
break
if json_entities:
print("提取的实体 (JSON):")
print(json_entities)
else:
print("未在响应中找到实体。")
提取的实体 (JSON):
{'entities': [{'name': 'John', 'type': 'PERSON', 'context': 'John works at Google in New York.'}, {'name': 'Google', 'type': 'ORGANIZATION', 'context': 'John works at Google in New York.'}, {'name': 'New York', 'type': 'LOCATION', 'context': 'John works at Google in New York.'}, {'name': 'Sarah', 'type': 'PERSON', 'context': 'He met with Sarah, the CEO of Acme Inc., last week in San Francisco.'}, {'name': 'Acme Inc.', 'type': 'ORGANIZATION', 'context': 'He met with Sarah, the CEO of Acme Inc., last week in San Francisco.'}, {'name': 'San Francisco', 'type': 'LOCATION', 'context': 'He met with Sarah, the CEO of Acme Inc., last week in San Francisco.'}]}
示例 3:情感分析
在此示例中,我们将使用 Claude 对给定文本执行情感分析,并以结构化 JSON 格式返回情感得分。
tools = [
{
"name": "print_sentiment_scores",
"description": "打印给定文本的情感得分。",
"input_schema": {
"type": "object",
"properties": {
"positive_score": {"type": "number", "description": "积极情感得分,范围从 0.0 到 1.0。"},
"negative_score": {"type": "number", "description": "消极情感得分,范围从 0.0 到 1.0。"},
"neutral_score": {"type": "number", "description": "中性情感得分,范围从 0.0 到 1.0。"}
},
"required": ["positive_score", "negative_score", "neutral_score"]
}
}
]
text = "The product was okay, but the customer service was terrible. I probably won't buy from them again."
query = f"""
<text>
{text}
</text>
使用 print_sentiment_scores 工具。
"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
messages=[{"role": "user", "content": query}]
)
json_sentiment = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_sentiment_scores":
json_sentiment = content.input
break
if json_sentiment:
print("情感分析 (JSON):")
print(json.dumps(json_sentiment, indent=2))
else:
print("未在响应中找到情感分析。")
情感分析 (JSON):
{
"negative_score": 0.6,
"neutral_score": 0.3,
"positive_score": 0.1
}
示例 4:文本分类
在此示例中,我们将使用 Claude 将给定文本分类到预定义类别中,并以结构化 JSON 格式返回分类结果。
tools = [
{
"name": "print_classification",
"description": "打印分类结果。",
"input_schema": {
"type": "object",
"properties": {
"categories": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "类别名称。"},
"score": {"type": "number", "description": "类别的分类得分,范围从 0.0 到 1.0。"}
},
"required": ["name", "score"]
}
}
},
"required": ["categories"]
}
}
]
text = "The new quantum computing breakthrough could revolutionize the tech industry."
query = f"""
<document>
{text}
</document>
使用 print_classification 工具。类别可以是政治、体育、技术、娱乐、商业。
"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
messages=[{"role": "user", "content": query}]
)
json_classification = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_classification":
json_classification = content.input
break
if json_classification:
print("文本分类 (JSON):")
print(json.dumps(json_classification, indent=2))
else:
print("未在响应中找到文本分类。")
文本分类 (JSON):
{
"categories": [
{
"name": "Politics",
"score": 0.1
},
{
"name": "Sports",
"score": 0.1
},
{
"name": "Technology",
"score": 0.7
},
{
"name": "Entertainment",
"score": 0.1
},
{
"name": "Business",
"score": 0.5
}
]
}
示例 5:处理未知键
在某些情况下,您可能无法预先知道确切的 JSON 对象形状。在此示例中,我们提供了一个开放式的 input_schema
,并通过提示指导 Claude 如何与该工具进行交互。
tools = [
{
"name": "print_all_characteristics",
"description": "打印提供的所有特征。",
"input_schema": {
"type": "object",
"additionalProperties": True
}
}
]
query = f"""根据角色描述,您的任务是提取角色的所有特征并使用 print_all_characteristics 工具打印它们。
print_all_characteristics 工具接受任意数量的输入,其中键是特征名称,值是特征值(年龄:28 或 眼睛颜色:绿色)。
<description>
The man is tall, with a beard and a scar on his left cheek. He has a deep voice and wears a black leather jacket.
</description>
现在使用 print_all_characteristics 工具。"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
tool_choice={"type": "tool", "name": "print_all_characteristics"},
messages=[{"role": "user", "content": query}]
)
tool_output = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_all_characteristics":
tool_output = content.input
break
if tool_output:
print("特征 (JSON):")
print(json.dumps(tool_output, indent=2))
else:
print("出错了。")
特征 (JSON):
{
"height": "tall",
"facial_hair": "beard",
"facial_features": "scar on left cheek",
"voice": "deep voice",
"clothing": "black leather jacket"
}
这些示例演示了如何使用 Claude 和工具使用功能为各种自然语言处理任务提取结构化 JSON 数据。通过定义具有特定输入架构的自定义工具,您可以指导 Claude 生成易于解析和在应用程序中使用的结构良好的 JSON 输出。