Tair作为OpenAI嵌入的向量数据库使用指南

本Notebook将指导您一步步了解如何使用Tair作为OpenAI嵌入的向量数据库。

本Notebook将展示一个端到端的流程:

  1. 使用OpenAI API预先计算的嵌入。
  2. 将嵌入存储在Tair的云实例中。
  3. 使用OpenAI API将原始文本查询转换为嵌入。
  4. 使用Tair在创建的集合中执行最近邻搜索。

什么是Tair

Tair是阿里巴巴集团开发的云原生内存数据库服务。Tair兼容开源Redis,并提供多种数据模型和企业级功能,以支持您的实时在线场景。Tair还引入了基于新型非易失性内存(NVM)存储介质的持久化内存优化实例。这些实例可降低30%的成本,确保数据持久性,并提供与内存数据库几乎相同的性能。Tair已广泛应用于政务、金融、制造、医疗和泛互联网等领域,以满足其高速查询和计算需求。

Tairvector是一种内部数据结构,可提供高性能的向量实时存储和检索。TairVector提供两种索引算法:分层可导航小世界(HNSW)和扁平搜索。此外,TairVector支持多种距离函数,如欧氏距离、内积和杰卡德距离。与传统的向量检索服务相比,TairVector具有以下优势:

  • 将所有数据存储在内存中,并支持实时索引更新,以降低读写操作的延迟。
  • 在内存中使用优化的数据结构,以更好地利用存储容量。
  • 作为开箱即用的数据结构,采用简单高效的架构,无需复杂的模块或依赖。

部署选项

先决条件

为了完成本次练习,我们需要准备几项内容:

  1. Tair云服务器实例。
  2. 用于与Tair数据库交互的'tair'库。
  3. 一个OpenAI API密钥

安装依赖

本Notebook显然需要openaitair包,但我们还将使用一些其他附加库。以下命令将它们全部安装:

! pip install openai redis tair pandas wget
Looking in indexes: http://sg.mirrors.cloud.aliyuncs.com/pypi/simple/
Requirement already satisfied: openai in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (0.28.0)
Requirement already satisfied: redis in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (5.0.0)
Requirement already satisfied: tair in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (1.3.6)
Requirement already satisfied: pandas in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (2.1.0)
Requirement already satisfied: wget in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (3.2)
Requirement already satisfied: requests>=2.20 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (2.31.0)
Requirement already satisfied: tqdm in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (4.66.1)
Requirement already satisfied: aiohttp in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (3.8.5)
Requirement already satisfied: async-timeout>=4.0.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from redis) (4.0.3)
Requirement already satisfied: numpy>=1.22.4 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (1.25.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2023.7.22)
Requirement already satisfied: attrs>=17.3.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (22.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.4.0)
Requirement already satisfied: aiosignal>=1.1.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.3.1)
 [33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [0m [33m
 [0m

准备您的OpenAI API密钥

OpenAI API密钥用于文档和查询的向量化。

如果您没有OpenAI API密钥,可以从https://beta.openai.com/account/api-keys获取。

获取密钥后,请使用getpass添加。

import getpass
import openai

openai.api_key = getpass.getpass("Input your OpenAI API key:")
Input your OpenAI API key:········

连接到Tair

首先将其添加到您的环境变量中。

使用官方Python库可以轻松连接到正在运行的Tair服务器实例。

# url的格式: redis://[[username]:[password]]@localhost:6379/0
TAIR_URL = getpass.getpass("Input your tair url:")
Input your tair url:········
from tair import Tair as TairClient

# 从url连接到tair并创建一个客户端

url = TAIR_URL
client = TairClient.from_url(url)

我们可以通过ping来测试连接:

client.ping()
True
import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# 该文件约700MB,因此需要一些时间
wget.download(embeddings_url)
100% [......................................................................] 698933052 / 698933052




'vector_database_wikipedia_articles_embedded (1).zip'

下载的文件随后必须解压:

import zipfile
import os
import re
import tempfile

current_directory = os.getcwd()
zip_file_path = os.path.join(current_directory, "vector_database_wikipedia_articles_embedded.zip")
output_directory = os.path.join(current_directory, "../../data")

with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(output_directory)


# 检查csv文件是否存在
file_name = "vector_database_wikipedia_articles_embedded.csv"
data_directory = os.path.join(current_directory, "../../data")
file_path = os.path.join(data_directory, file_name)


if os.path.exists(file_path):
    print(f"The file {file_name} exists in the data directory.")
else:
    print(f"The file {file_name} does not exist in the data directory.")
The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.

创建索引

Tair将数据存储在索引中,其中每个对象由一个键描述。每个键包含一个向量和多个属性键。

我们将从创建两个索引开始,一个用于title_vector,一个用于content_vector,然后用我们预先计算的嵌入填充它们。

# 设置索引参数
index = "openai_test"
embedding_dim = 1536
distance_type = "L2"
index_type = "HNSW"
data_type = "FLOAT32"

# 创建两个索引,一个用于title_vector,一个用于content_vector,如果已存在则跳过
index_names = [index + "_title_vector", index+"_content_vector"]
for index_name in index_names:
    index_connection = client.tvs_get_index(index_name)
    if index_connection is not None:
        print("Index already exists")
    else:
        client.tvs_create_index(name=index_name, dim=embedding_dim, distance_type=distance_type,
                                index_type=index_type, data_type=data_type)
Index already exists
Index already exists

加载数据

在本节中,我们将加载之前准备好的数据,这样您就不必花费自己的积分重新计算维基百科文章的嵌入。

import pandas as pd
from ast import literal_eval
# 本地CSV文件的路径
csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'
article_df = pd.read_csv(csv_file_path)

# 将向量从字符串读回列表
article_df['title_vector'] = article_df.title_vector.apply(literal_eval).values
article_df['content_vector'] = article_df.content_vector.apply(literal_eval).values

# 向索引添加/更新数据
for i in range(len(article_df)):
    # 向带有title_vector的索引添加数据
    client.tvs_hset(index=index_names[0], key=article_df.id[i].item(), vector=article_df.title_vector[i], is_binary=False,
                    **{"url": article_df.url[i], "title": article_df.title[i], "text": article_df.text[i]})
    # 向带有content_vector的索引添加数据
    client.tvs_hset(index=index_names[1], key=article_df.id[i].item(), vector=article_df.content_vector[i], is_binary=False,
                    **{"url": article_df.url[i], "title": article_df.title[i], "text": article_df.text[i]})
# 检查数据计数以确保所有点都已存储
for index_name in index_names:
    stats = client.tvs_get_index(index_name)
    count = int(stats["current_record_count"]) - int(stats["delete_record_count"])
    print(f"Count in {index_name}:{count}")
Count in openai_test_title_vector:25000
Count in openai_test_content_vector:25000

搜索数据

将数据放入Tair后,我们将开始查询集合以查找最接近的向量。我们可以提供一个附加参数vector_name来从标题切换到基于内容的搜索。由于预先计算的嵌入是使用text-embedding-3-small OpenAI模型创建的,因此我们在搜索时也必须使用它。

def query_tair(client, query, vector_name="title_vector", top_k=5):

    # 从用户查询创建嵌入向量
    embedded_query = openai.Embedding.create(
        input= query,
        model="text-embedding-3-small",
    )["data"][0]['embedding']
    embedded_query = np.array(embedded_query)

    # 在索引中搜索向量的top k近似最近邻
    query_result = client.tvs_knnsearch(index=index+"_"+vector_name, k=top_k, vector=embedded_query)

    return query_result
import openai
import numpy as np

query_result = query_tair(client=client, query="modern art in Europe", vector_name="title_vector")
for i in range(len(query_result)):
    title = client.tvs_hmget(index+"_"+"content_vector", query_result[i][0].decode('utf-8'), "title")
    print(f"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})")
1. Museum of Modern Art (Distance: 0.125)
2. Western Europe (Distance: 0.133)
3. Renaissance art (Distance: 0.136)
4. Pop art (Distance: 0.14)
5. Northern Europe (Distance: 0.145)
# 这次我们将使用内容向量进行查询
query_result = query_tair(client=client, query="Famous battles in Scottish history", vector_name="content_vector")
for i in range(len(query_result)):
    title = client.tvs_hmget(index+"_"+"content_vector", query_result[i][0].decode('utf-8'), "title")
    print(f"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})")
1. Battle of Bannockburn (Distance: 0.131)
2. Wars of Scottish Independence (Distance: 0.139)
3. 1651 (Distance: 0.147)
4. First War of Scottish Independence (Distance: 0.15)
5. Robert I of Scotland (Distance: 0.154)