使用 Pinecone 进行嵌入搜索

本笔记本将引导您完成一个简单的流程，以下载一些数据、对其进行嵌入，然后使用一系列向量数据库对其进行索引和搜索。这是客户希望在安全的环境中存储和搜索我们的嵌入以支持生产用例（如聊天机器人、主题建模等）的常见需求。

什么是向量数据库

向量数据库是一种用于存储、管理和搜索嵌入向量的数据库。近年来，由于人工智能在解决涉及自然语言、图像识别和其他非结构化数据形式的用例方面越来越有效，使用嵌入将非结构化数据（文本、音频、视频等）编码为供机器学习模型使用的向量已呈爆炸式增长。向量数据库已成为企业交付和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够利用我们在此存储库中共享的许多嵌入用例（例如，问答、聊天机器人和推荐服务），并在安全、可扩展的环境中使用它们。我们的许多客户在小规模上使用嵌入来解决他们的问题，但性能和安全性阻碍了他们投入生产——我们将向量数据库视为解决此问题的关键组成部分，在本指南中，我们将介绍嵌入文本数据、将其存储在向量数据库中以及将其用于语义搜索的基础知识。

演示流程

演示流程如下：

设置：导入包并设置任何必需的变量
加载数据：加载数据集并使用 OpenAI 嵌入对其进行嵌入
Pinecone
- 设置：在这里我们将设置 Pinecone 的 Python 客户端。有关更多详细信息，请访问此处
- 索引数据：我们将创建一个包含标题和内容命名空间的索引
- 搜索数据：我们将使用搜索查询测试这两个命名空间，以确认其正常工作

运行完此笔记本后，您应该对如何设置和使用向量数据库有一个基本的了解，然后可以继续进行更复杂的用例，利用我们的嵌入。

设置

导入所需的库并设置我们想要使用的嵌入模型。

# 我们需要安装 Pinecone 客户端
!pip install pinecone-client

# 安装 wget 以下载 zip 文件
!pip install wget

Requirement already satisfied: pinecone-client in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (2.2.2)
Requirement already satisfied: requests>=2.19.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.31.0)
Requirement already satisfied: pyyaml>=5.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (6.0)
Requirement already satisfied: loguru>=0.5.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (0.7.0)
Requirement already satisfied: typing-extensions>=3.7.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.5.0)
Requirement already satisfied: dnspython>=2.0.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.3.0)
Requirement already satisfied: python-dateutil>=2.5.3 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.8.2)
Requirement already satisfied: urllib3>=1.21.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.26.16)
Requirement already satisfied: tqdm>=4.64.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.65.0)
Requirement already satisfied: numpy>=1.22.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.25.0)
Requirement already satisfied: six>=1.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from python-dateutil>=2.5.3->pinecone-client) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (2023.5.7)
Requirement already satisfied: wget in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (3.2)

import openai

from typing import List, Iterator
import pandas as pd
import numpy as np
import os
import wget
from ast import literal_eval

# Pinecone 的 Python 客户端库
import pinecone

# 我已将其设置为我们新的嵌入模型，可以更改为您选择的嵌入模型
EMBEDDING_MODEL = "text-embedding-3-small"

# 忽略未关闭的 SSL 套接字警告 - 如果遇到这些错误，可以选择忽略
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

/Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

加载数据

在本节中，我们将加载我们在此文章中准备好的嵌入数据。

embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# 该文件约 700 MB，因此需要一些时间
wget.download(embeddings_url)

import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")

article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')

article_df.head()

	id	url	title	text	title_vector	content_vector	vector_id
0	1	https://simple.wikipedia.org/wiki/April	April	April is the fourth month of the year in the J...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	August	August (Aug.) is the eighth month of the year ...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	Art	Art is a creative activity that expresses imag...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A or a is the first letter of the English alph...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	Air	Air refers to the Earth's atmosphere. Air is a...	[0.02224554680287838, -0.02044147066771984, -0...	[0.021524671465158463, 0.018522677943110466, -...	4

# 将向量从字符串读回列表
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# 将 vector_id 设置为字符串
article_df['vector_id'] = article_df['vector_id'].apply(str)

article_df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB

Pinecone

接下来我们来看 Pinecone，这是一个托管的向量数据库，提供云原生选项。

在继续此步骤之前，您需要导航到Pinecone，注册并然后将您的 API 密钥保存为名为 PINECONE_API_KEY 的环境变量。

在本节中，我们将：

创建一个包含文章标题和内容的多个命名空间的索引
将我们的数据存储在索引中，并为文章标题和内容设置单独的可搜索“命名空间”
发起一些相似性搜索查询以验证我们的设置是否正常工作

api_key = os.getenv("PINECONE_API_KEY")
pinecone.init(api_key=api_key)

创建索引

首先，我们需要创建一个名为 wikipedia-articles 的索引。一旦我们有了索引，我们就可以创建多个命名空间，这可以使单个索引针对各种用例进行搜索。有关更多详细信息，请参阅Pinecone 文档。

如果您想并行批量插入到您的索引中以提高插入速度，Pinecone 文档中有一个关于并行批量插入的优秀指南。

# 模拟一个简单的批量生成器，它将输入 DataFrame 分成块
class BatchGenerator:


    def __init__(self, batch_size: int = 10) -> None:
        self.batch_size = batch_size

    # 将 DataFrame 分成块
    def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:
        splits = self.splits_num(df.shape[0])
        if splits <= 1:
            yield df
        else:
            for chunk in np.array_split(df, splits):
                yield chunk

    # 确定 DataFrame 包含多少个块
    def splits_num(self, elements: int) -> int:
        return round(elements / self.batch_size)

    __call__ = to_batches

df_batcher = BatchGenerator(300)

# 为新索引选择一个名称
index_name = 'wikipedia-articles'

# 检查具有相同名称的索引是否已存在 - 如果存在，则删除它
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

# 创建新索引
pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))
index = pinecone.Index(index_name=index_name)

# 确认我们的索引已创建
pinecone.list_indexes()

['podcasts', 'wikipedia-articles']

# 在 content 命名空间中上传 content 向量 - 这可能需要几分钟时间
print("Uploading vectors to content namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')

Uploading vectors to content namespace..

# 在 title 命名空间中上传 title 向量 - 这也可能需要几分钟时间
print("Uploading vectors to title namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')

Uploading vectors to title namespace..

# 检查每个命名空间的索引大小以确认所有文档都已加载
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'content': {'vector_count': 25000},
                'title': {'vector_count': 25000}},
 'total_vector_count': 50000}

搜索数据

现在我们将输入一些虚拟搜索并检查我们是否能得到不错的结果

# 首先，我们将创建将向量 ID 映射到其输出的字典，以便我们可以检索搜索结果的文本
titles_mapped = dict(zip(article_df.vector_id,article_df.title))
content_mapped = dict(zip(article_df.vector_id,article_df.text))

def query_article(query, namespace, top_k=5):
    '''使用指定命名空间中的文章标题查询文章并打印结果。'''

    # 根据标题列创建向量嵌入
    embedded_query = openai.Embedding.create(
                                            input=query,
                                            model=EMBEDDING_MODEL,
                                            )["data"][0]['embedding']

    # 使用标题向量查询传递的命名空间
    query_result = index.query(embedded_query, 
                                      namespace=namespace, 
                                      top_k=top_k)

    # 打印查询结果 
    print(f'\nMost similar results to {query} in "{namespace}" namespace:\n')
    if not query_result.matches:
        print('no query result')

    matches = query_result.matches
    ids = [res.id for res in matches]
    scores = [res.score for res in matches]
    df = pd.DataFrame({'id':ids, 
                       'score':scores,
                       'title': [titles_mapped[_id] for _id in ids],
                       'content': [content_mapped[_id] for _id in ids],
                       })

    counter = 0
    for k,v in df.iterrows():
        counter += 1
        print(f'{v.title} (score = {v.score})')

    print('\n')

    return df

query_output = query_article('modern art in Europe','title')

Most similar results to modern art in Europe in "title" namespace:

Museum of Modern Art (score = 0.875177085)
Western Europe (score = 0.867441177)
Renaissance art (score = 0.864156306)
Pop art (score = 0.860346854)
Northern Europe (score = 0.854658186)

content_query_output = query_article("Famous battles in Scottish history",'content')

Most similar results to Famous battles in Scottish history in "content" namespace:

Battle of Bannockburn (score = 0.869336188)
Wars of Scottish Independence (score = 0.861470938)
1651 (score = 0.852588475)
First War of Scottish Independence (score = 0.84962213)
Robert I of Scotland (score = 0.846214116)