如何使用 Claude 3 和 MongoDB 构建 RAG 系统

本教程实现了一个被提示扮演风险投资科技分析师角色的聊天机器人。该聊天机器人是一个简单的 RAG 系统，以一系列科技新闻文章作为其知识来源。本笔记本涵盖以下内容：

遵循全面的教程，从安装必要的库到配置 MongoDB 数据库，来设置您的开发环境。
学习高效的数据处理方法，包括创建向量搜索索引以及准备数据以供摄取和查询处理。
了解如何在 RAG 系统中使用 Claude 3 模型，以根据从数据库检索到的上下文信息生成精确的响应。

您将需要以下内容：

Anthropic API 密钥
VoyageAI API 密钥
Hugging Face 访问令牌

步骤 1：库安装、数据加载和准备

以下是对实现代码中使用的工具和库的简要说明：

anthropic：Anthropic 的官方 Python 库，可访问最先进的语言模型。该库提供了对 Claude 3 系列模型的访问，这些模型可以理解文本和图像。
datasets：该库是 Hugging Face 生态系统的一部分。通过安装“datasets”，我们可以访问许多预处理好的、可直接使用的数据集，这些数据集对于训练和微调机器学习模型或对其性能进行基准测试至关重要。
pandas：这个数据科学库提供了强大的数据结构和方法，用于数据操作、处理和分析。
voyageai：这是访问 VoyageAI 的嵌入模型套件的官方 Python 客户端库。
pymongo：PyMongo 是 MongoDB 的 Python 工具包。它支持与 MongoDB 数据库进行交互。

!pip install pymongo datasets pandas anthropic voyageai

下面的代码片段执行以下步骤：

导入必要的库： - os 用于与操作系统交互， - requests 用于发出 HTTP 请求， - BytesIO 从 io 模块中用于在内存中处理像文件一样的字节对象， - pandas（别名为 pd）用于数据操作和分析，以及 - userdata 从 google.colab 中用于访问存储在 Google Colab Secrets 中的环境变量。
函数定义：定义了 download_and_combine_parquet_files 函数，该函数有两个参数： - parquet_file_urls：一个字符串 URL 列表，每个 URL 指向一个包含 tech-news-embedding 数据集子集的 Parquet 文件。 - hf_token 是一个代表 Hugging Face 授权令牌的字符串。可以在 Hugging Face 平台创建或复制访问令牌。
下载和读取 Parquet 文件：该函数遍历 parquet_file_urls 中的每个 URL。对于每个 URL： - 使用 requests.get 方法发出 GET 请求，传递 URL 和授权标头。 - 检查响应状态码是否为 200（OK），表示请求成功。 - 如果成功，它将响应内容读取到 BytesIO 对象中（以在内存中将其作为文件处理），然后使用 pandas.read_parquet 从该对象中将 Parquet 文件读取到 Pandas DataFrame 中。 - 将 DataFrame 追加到 all_dataframes 列表中。
合并 DataFrame：在下载和读取所有 Parquet 文件到 DataFrame 后，会进行检查以确保 all_dataframes 不为空。如果存在要处理的 DataFrame，则使用 pd.concat 将所有 DataFrame 连接成一个 DataFrame，并将 ignore_index=True 以重新索引新的合并 DataFrame。这个合并的 DataFrame 是 download_and_combine_parquet_files 函数中的总体处理输出。

import os
import requests
from io import BytesIO
import pandas as pd
from google.colab import userdata

def download_and_combine_parquet_files(parquet_file_urls, hf_token):
    """
    使用给定的 Hugging Face 令牌从提供的 URL 下载 Parquet 文件，
    并返回一个合并的 DataFrame。

    参数：

    - parquet_file_urls：字符串列表，Parquet 文件的 URL。
    - hf_token：字符串，Hugging Face 授权令牌。

    返回值：

    - combined_df：一个包含所有 Parquet 文件合并数据的 pandas DataFrame。
    """
    headers = {"Authorization": f"Bearer {hf_token}"}
    all_dataframes = []

    for parquet_file_url in parquet_file_urls:
        response = requests.get(parquet_file_url, headers=headers)
        if response.status_code == 200:
            parquet_bytes = BytesIO(response.content)
            df = pd.read_parquet(parquet_bytes)
            all_dataframes.append(df)
        else:
            print(f"Failed to download Parquet file from {parquet_file_url}: {response.status_code}")

    if all_dataframes:
        combined_df = pd.concat(all_dataframes, ignore_index=True)
        return combined_df
    else:
        print("No dataframes to concatenate.")
        return None

下面是本教程所需的 Parquet 文件列表。所有文件的完整列表位于此处。每个 Parquet 文件代表大约 45,000 个数据点。

在下面的代码片段中，tech-news-embeddings 数据集的一个子集被分组到一个 DataFrame 中，然后将其分配给变量 combined_df。

# 取消注释下面的链接以加载更多数据
# 有关数据完整列表，请访问：https://huggingface.co/datasets/MongoDB/tech-news-embeddings/tree/refs%2Fconvert%2Fparquet/default/train
parquet_files = [
    "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0000.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0001.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0002.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0003.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0004.parquet",
    # "https://huggingface.co/api/datasets/AIatMongoDB/tech-news-embeddings/parquet/default/train/0005.parquet",
]

hf_token = userdata.get("HF_TOKEN")
combined_df = download_and_combine_parquet_files(parquet_files, hf_token)

作为数据准备的最后阶段，下面的代码片段展示了删除分组数据集中的 _id 列的步骤，因为它对于本教程的后续步骤是不必要的。此外，将每个数据点的嵌入列中的数据从 numpy 数组转换为 Python 列表，以防止在数据摄取过程中出现与不兼容数据类型相关的错误。

# 从初始数据集中删除 _id 列
combined_df = combined_df.drop(columns=['_id'])

# 删除初始嵌入列，因为我们将使用 VoyageAI 嵌入模型创建新的嵌入
combined_df = combined_df.drop(columns=['embedding'])

combined_df.head()

# 由于 VoyageAI API 的速率限制，将本文档数量限制为 500 以用于此演示
# 在 VoyageAI 速率限制上阅读更多内容：https://docs.voyageai.com/docs/rate-limits
max_documents = 500

if len(combined_df) > max_documents:
    combined_df = combined_df[:max_documents]

import voyageai
import time

vo = voyageai.Client(api_key=userdata.get("VOYAGE_API_KEY"))

def get_embedding(text: str) -> list[float]:
    if not text.strip():
      print("Attempted to get embedding for empty text.")
      return []

    embedding = vo.embed(text, model="voyage-large-2", input_type="document")

    return embedding.embeddings[0]

combined_df["embedding"] = combined_df["description"].apply(get_embedding)

combined_df.head()

步骤 2：数据库和集合创建

要创建新的 MongoDB 数据库，请设置数据库集群：

注册一个免费的 MongoDB Atlas 账户，或现有用户，请登录 MongoDB Atlas
在左侧窗格中选择“数据库”选项，这将导航到数据库部署页面，其中包含任何现有集群的部署规范。通过单击“+创建”按钮创建新的数据库集群。
有关数据库集群设置和获取 URI 的帮助，请参阅我们关于设置 MongoDB 集群和获取连接字符串的指南。注意：创建概念验证时，请勿忘记为 Python 主机或任何 IP 的 0.0.0.0/0 允许 IP 地址。
成功创建和部署集群后，集群将在“数据库部署”页面上可用。
单击集群的“连接”按钮，查看通过各种语言驱动程序连接到集群的选项。
本教程仅需要集群的 URI（唯一资源标识符）。获取 URI 并将其复制到名为 MONGO_URI 的变量的 Google Colabs Secrets 环境中，或将其放入 .env 文件或等效文件中。

创建集群后，导航到集群页面，然后通过单击 + 创建数据库在 MongoDB Atlas 集群中创建数据库和集合。数据库将命名为 tech_news，集合将命名为 hacker_noon_tech_news。

步骤 3：向量搜索索引创建

此时，您已经创建了集群、数据库和集合。

本节中的步骤对于确保可以使用输入到聊天机器人的查询进行向量搜索，并在 hacker_noon_tech_news 集合中的记录之间进行搜索至关重要。此步骤的目的是创建向量搜索索引。为此，请参阅官方的向量搜索索引创建指南。

在使用 MongoDB Atlas 上的 JSON 编辑器创建向量搜索索引时，请确保您的向量搜索索引名为 vector_index，并且向量搜索索引定义如下：

{
 "fields": [{
     "numDimensions": 1536,
     "path": "embedding",
     "similarity": "cosine",
     "type": "vector"
   }]
}

步骤 4：数据摄取

将数据摄取到先前步骤中创建的 MongoDB 数据库中。必须执行以下操作：

连接到数据库和集合
清除集合中的任何现有记录
在摄取之前将数据集的 Pandas DataFrame 转换为字典
使用批量操作将字典摄取到 MongoDB 中

本教程需要集群的 URI（唯一资源标识符）。获取 URI 并将其复制到名为 MONGO_URI 的变量的 Google Colab Secrets 环境中，或将其放入 .env 文件或等效文件中。

import pymongo
from google.colab import userdata

def get_mongo_client(mongo_uri):
  """建立与 MongoDB 的连接。"""
  try:
    client = pymongo.MongoClient(mongo_uri)
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

mongo_uri = userdata.get('MONGO_URI')
if not mongo_uri:
  print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

DB_NAME="tech_news"
COLLECTION_NAME="hacker_noon_tech_news"

db = mongo_client[DB_NAME]
collection = db[COLLECTION_NAME]

Connection to MongoDB successful

# 为确保我们处理的是一个全新的集合
# 删除集合中任何现有的记录
collection.delete_many({})

DeleteResult({'n': 228012, 'electionId': ObjectId('7fffffff000000000000000e'), 'opTime': {'ts': Timestamp(1709660559, 7341), 't': 14}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1709660559, 7341), 'signature': {'hash': b'jT\xf1\xb4\xa9\xd3\xe3suu\x03`\x15(}\x8f\x00\x9f\xe9\x8a', 'keyId': 7320226449804230661}}, 'operationTime': Timestamp(1709660559, 7341)}, acknowledged=True)

# 数据摄取
combined_df_json = combined_df.to_dict(orient='records')
collection.insert_many(combined_df_json)

步骤 5：向量搜索

本节展示了如何创建一个向量搜索自定义函数，该函数接受用户查询，对应于聊天机器人的条目。该函数还接受第二个参数 collection，它指向包含记录的数据库集合，向量搜索操作应针对这些记录进行。

vector_search 函数生成一个向量搜索结果，该结果源自 MongoDB 聚合管道中概述的一系列操作。此管道包括 $vectorSearch 和 $project 阶段，并根据用户查询的向量嵌入执行查询。然后，它格式化结果，省略了后续过程不需要任何记录属性。

下面的代码片段执行以下操作以实现电影的语义搜索：

定义 vector_search 函数，该函数以用户查询和 MongoDB 集合作为输入，并根据向量相似性搜索返回匹配文档的列表。
通过调用先前定义的 get_embedding 函数来生成用户查询的嵌入，该函数将查询字符串转换为向量表示。
构建一个用于 MongoDB 的 aggregate 函数的管道，其中包含两个主要阶段：$vectorSearch 和 $project。
$vectorSearch 阶段执行实际的向量搜索。索引字段指定要用于向量搜索的向量索引，这应与先前步骤中向量搜索索引定义中输入的名称相对应。queryVector 字段接受用户查询的嵌入表示。path 字段对应于包含嵌入的文档字段。numCandidates 指定要考虑的候选文档数量以及要返回的结果数量限制。
$project 阶段格式化结果，以排除 _id 和 embedding 字段。
aggregate 执行定义的管道以获取向量搜索结果。最后一步将从数据库返回的游标转换为列表。

def vector_search(user_query, collection):
    """
    根据用户查询在 MongoDB 集合中执行向量搜索。

    参数：
    user_query (str)：用户的查询字符串。
    collection (MongoCollection)：要搜索的 MongoDB 集合。

    返回值：
    list：匹配文档的列表。
    """

    # 为用户查询生成嵌入
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # 定义向量搜索管道
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # 要考虑的候选匹配数量
                "limit": 5  # 返回前 5 个匹配项
            }
        },
        {
            "$project": {
                "_id": 0,  # 排除 _id 字段
                "embedding": 0,  # 排除 embedding 字段
                "score": {
                    "$meta": "vectorSearchScore"  # 包括搜索分数
                }
            }
        }
    ]

    # 执行搜索
    results = collection.aggregate(pipeline)
    return list(results)

步骤 6：使用 Claude 3 模型处理用户查询

本教程的最后一部分概述了执行的操作顺序如下：

接受用户查询（字符串形式）。
使用 VoyageAI 嵌入模型为用户查询生成嵌入。
加载 Anthropic Claude 3，特别是“claude-3-opus-20240229”模型，作为 RAG 系统的基础模型。
使用用户查询的嵌入执行向量搜索，以从知识库中获取相关信息，为基础模型提供额外上下文。
将用户查询和收集到的额外信息提交给基础模型以生成响应。

一个重要的注意事项是，用户查询嵌入的维度与 MongoDB Atlas 上的向量搜索索引定义中设置的维度相匹配。

此步骤中的下一步是导入 anthropic 库并加载客户端以访问 anthropic 处理消息和访问 Claude 模型的方法。确保您获取 Anthropic API 密钥，该密钥位于官方 Anthropic 网站的设置页面中。

import anthropic
client = anthropic.Client(api_key=userdata.get("ANTHROPIC_API_KEY"))

下面是对下面代码片段中操作的更详细描述：

向量搜索执行：该函数首先使用用户的查询和指定的集合作为参数调用 vector_search。这会在集合内执行搜索，利用向量嵌入来查找与查询相关的相关信息。
编译搜索结果：search_result 初始化为空字符串，用于聚合搜索信息。通过迭代 vector_search 函数返回的结果来编译搜索结果，将每个项目的详细信息（标题、公司名称、URL、发布日期、文章 URL 和描述）格式化为人类可读的字符串，并在每个条目末尾附加信息和换行符 \n。
使用 Anthropic 客户端生成响应：然后，该函数构建一个请求到 Anthropic API（通过客户端对象，可能是先前创建的 anthropic.Client 类的实例）。它指定： - 要使用的模型（“claude-3-opus-20240229”）表示 Claude 3 模型的特定版本。 - 生成响应的最大令牌限制（max_tokens=1024）。 - 系统描述指导模型充当“风险投资科技分析师”，可以访问科技公司文章和信息，并利用此上下文提供建议。 - 要处理的实际消息结合了用户查询和收集到的搜索结果作为上下文。
返回生成的响应和搜索结果：它从响应内容的第一项中提取并返回响应文本，以及编译的搜索结果。

def handle_user_query(query, collection):

  get_knowledge = vector_search(query, collection)

  search_result = ''
  for result in get_knowledge:
    search_result += (
        f"Title: {result.get('title', 'N/A')}, "
        f"Company Name: {result.get('companyName', 'N/A')}, "
        f"Company URL: {result.get('companyUrl', 'N/A')}, "
        f"Date Published: {result.get('published_at', 'N/A')}, "
        f"Article URL: {result.get('url', 'N/A')}, "
        f"Description: {result.get('description', 'N/A')}, \n"
    )

  response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    system="You are Venture Captital Tech Analyst with access to some tech company articles and information. You use the information you are given to provide advice.",
    messages=[
        {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result}
    ]
  )

  return (response.content[0].text), search_result

本教程的最后一步是初始化查询，将其传递给 handle_user_query 函数并打印返回的响应。

# 进行查询并检索来源
query = "Give me the best tech stock to invest in and tell me why"
response, source_information = handle_user_query(query, collection)

print(f"Response: {response}")
print(f"\\nSource Information: \\n{source_information}")

Response: Based on the information provided in the article titles and descriptions, Alibaba Group Holding Limited appears to be a top technology stock pick for 2023 according to renowned investor Ray Dalio. The article "Top 10 Technology Stocks to Buy in 2023 According to Ray Dalio" suggests that Alibaba is one of Dalio's favored tech investments for the year.

As a venture capital tech analyst, I would recommend considering an investment in Alibaba for the following reasons:

1. Endorsement from a respected investor: Ray Dalio, known for his successful investment strategies, has included Alibaba in his top 10 technology stock picks for 2023. His backing lends credibility to the investment potential of the company.

2. Strong market position: Alibaba is a leading e-commerce company in China with a significant market share. It has a diversified business model spanning e-commerce, cloud computing, digital media, and entertainment.

3. Growth potential: With China's large and growing middle class, Alibaba is well-positioned to benefit from increasing consumer spending and the shift towards online shopping.

However, it's essential to consider the following points as well:

1. Regulatory risks: Chinese tech companies, including Alibaba, have faced increased regulatory scrutiny in recent times. Changes in government policies could impact the company's growth and profitability.

2. Competition: While Alibaba is a dominant player, it faces competition from other tech giants like Tencent and JD.com in various business segments.

3. Geopolitical tensions: Ongoing tensions between the U.S. and China could lead to market volatility and impact investor sentiment towards Chinese stocks.

As with any investment, it's crucial to conduct thorough research, consider your risk tolerance, and diversify your portfolio. Keep in mind that the information provided here is based on limited data points, and stock prices can be influenced by various factors beyond the scope of this context.
Source Information: 
Title: Top 10 Technology Stocks to Buy in 2023 According to Ray Dalio, Company Name: ALIBABA GROUP HOLDING LIMITED, Company URL: https://hackernoon.com/company/alibabagroupholdinglimited, Date Published: 2023-04-21 11:58:00, Article URL: https://uk.finance.yahoo.com/news/top-10-technology-stocks-buy-155830366.html, Description: In this article we discuss the top 10 technology stocks to buy in 2023 according to Ray Dalio. If you want to skip our detailed analysis of Dalio’s investment philosophy and portfolio construction, 
Title: 3 Tech Stocks I Love Right Now, Company Name: 10Clouds, Company URL: https://hackernoon.com/company/10clouds, Date Published: 2023-04-02 11:30:00, Article URL: https://www.msn.com/en-xl/money/other/3-tech-stocks-i-love-right-now/ar-AA19n9Ht, Description: These are tech giants but they''re also great investments., 
Title: 3 Millionaire-Maker Hydrogen Stocks to Buy Before the Window Closes, Company Name: Air Products & Chemicals, Company URL: https://hackernoon.com/company/airproductschemicals, Date Published: 2023-07-28 12:18:00, Article URL: https://www.msn.com/en-us/money/topstocks/3-millionaire-maker-hydrogen-stocks-to-buy-before-the-window-closes/ar-AA1etN8O, Description: These are the best hydrogen stocks to buy with multibagger returns potential.More From InvestorPlace Buy This $5 Stock BEFORE This Apple Project Goes Live Wall Street Titan: Here’s My #1 Stock for 2023 The $1 Investment You MUST Take Advantage of Right Now It doesn’t matter if you have $500 or $5 million., 
Title: Why it may be time to sell the pop in tech stocks: BlackRock, Company Name: BlackRock, Company URL: https://hackernoon.com/company/blackrock, Date Published: 2023-02-13 19:06:00, Article URL: https://news.yahoo.com/why-it-may-be-time-to-sell-the-pop-in-tech-stocks-blackrock-190606866.html, Description: Household tech names like Apple Meta and Netflix have soared so far in 2023 but one strategist says the gains aren''t likely to last. It''s time to take profits on tech stocks — the early sector winner of 2023 — as the Federal Reserve may soon dash hopes for a pivot on interest rates, 
Title: The harsh reality for investors eyeing tech stocks in 2023: Morning Brief, Company Name: 10Clouds, Company URL: https://hackernoon.com/company/10clouds, Date Published: 2023-01-02 11:18:00, Article URL: https://news.yahoo.com/the-harsh-reality-for-investors-eyeing-tech-stocks-in-2023-morning-brief-111854804.html, Description: Curious on how to buy battered tech stocks? Here''s a quick tip. More on that and what else to watch in business on Monday.,