使用 Redis 和 OpenAI 运行混合 VSS 查询
本笔记本介绍了如何将 Redis 用作向量数据库,并结合 OpenAI 嵌入,使用 Redis 查询和搜索功能运行结合了 VSS 和词汇搜索的混合查询。Redis 是一个可扩展的、实时的数据库,当使用 RediSearch 模块 时,它可以用作向量数据库。Redis 查询和搜索功能允许您在 Redis 中索引和搜索向量。本笔记本将向您展示如何使用 Redis 查询和搜索来索引和搜索通过 OpenAI API 创建并存储在 Redis 中的向量。
混合查询将向量相似性与传统的 Redis 查询和搜索功能(用于 GEO、NUMERIC、TAG 或 TEXT 数据)相结合,从而简化了应用程序代码。在电子商务用例中,混合查询的一个常见示例是查找在地理位置和价格范围内可用的商品中,与给定查询图像在视觉上相似的商品。
先决条件
在开始此项目之前,我们需要进行以下设置:
- 使用 RediSearch (redis-stack) 启动 Redis 数据库
- 安装库
- 获取您的 OpenAI API 密钥
===========================================================
启动 Redis
为简单起见,我们将使用 Redis Stack docker 容器,可以按如下方式启动:
$ docker-compose up -d
这还包括用于管理 Redis 数据库的 RedisInsight GUI,在启动 docker 容器后,您可以在 http://localhost:8001 上查看它。
您已全部设置完毕,可以开始使用了!接下来,我们将导入并创建用于与我们刚刚创建的 Redis 数据库通信的客户端。
安装要求
Redis-Py 是用于与 Redis 通信的 Python 客户端。我们将使用它与我们的 Redis-stack 数据库进行通信。
! pip install redis pandas openai
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: redis in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (4.5.4)
Requirement already satisfied: pandas in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (2.0.1)
Requirement already satisfied: openai in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (0.27.6)
Requirement already satisfied: async-timeout>=4.0.2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from redis) (4.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3)
Requirement already satisfied: numpy>=1.20.3 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (1.23.4)
Requirement already satisfied: requests>=2.20 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from openai) (2.28.1)
Requirement already satisfied: tqdm in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from openai) (4.64.1)
Requirement already satisfied: aiohttp in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from openai) (3.8.4)
Requirement already satisfied: six>=1.5 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (1.26.12)
Requirement already satisfied: certifi>=2017.4.17 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (2022.9.24)
Requirement already satisfied: attrs>=17.3.0 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (23.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (1.3.1)
===========================================================
准备您的 OpenAI API 密钥
OpenAI API 密钥
用于对查询数据进行向量化。
如果您没有 OpenAI API 密钥,可以从 https://beta.openai.com/account/api-keys 获取。
获取密钥后,请使用以下命令将其作为 OPENAI_API_KEY
添加到环境变量中:
# 测试您的 OpenAI API 密钥是否已正确设置为环境变量
# 注意:如果您在本地运行此笔记本,则需要重新加载终端和笔记本才能使环境变量生效。
import os
import openai
os.environ["OPENAI_API_KEY"] = '<YOUR_OPENAI_API_KEY>'
if os.getenv("OPENAI_API_KEY") is not None:
openai.api_key = os.getenv("OPENAI_API_KEY")
print ("OPENAI_API_KEY is ready")
else:
print ("OPENAI_API_KEY environment variable not found")
OPENAI_API_KEY is ready
加载数据
在本节中,我们将加载和清理电子商务数据集。我们将使用 OpenAI 生成嵌入,并使用这些数据在 Redis 中创建索引,然后搜索相似的向量。
import pandas as pd
import numpy as np
from typing import List
from utils.embeddings_utils import (
get_embeddings,
distances_from_embeddings,
tsne_components_from_embeddings,
chart_from_components,
indices_of_nearest_neighbors_from_distances,
)
EMBEDDING_MODEL = "text-embedding-3-small"
# 加载数据并清理数据类型以及删除空行
df = pd.read_csv("../../data/styles_2k.csv", on_bad_lines='skip')
df.dropna(inplace=True)
df["year"] = df["year"].astype(int)
df.info()
# 打印数据框
n_examples = 5
df.head(n_examples)
<class 'pandas.core.frame.DataFrame'>
Index: 1978 entries, 0 to 1998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1978 non-null int64
1 gender 1978 non-null object
2 masterCategory 1978 non-null object
3 subCategory 1978 non-null object
4 articleType 1978 non-null object
5 baseColour 1978 non-null object
6 season 1978 non-null object
7 year 1978 non-null int64
8 usage 1978 non-null object
9 productDisplayName 1978 non-null object
dtypes: int64(2), object(8)
memory usage: 170.0+ KB
id | gender | masterCategory | subCategory | articleType | baseColour | season | year | usage | productDisplayName | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 15970 | Men | Apparel | Topwear | Shirts | Navy Blue | Fall | 2011 | Casual | Turtle Check Men Navy Blue Shirt |
1 | 39386 | Men | Apparel | Bottomwear | Jeans | Blue | Summer | 2012 | Casual | Peter England Men Party Blue Jeans |
2 | 59263 | Women | Accessories | Watches | Watches | Silver | Winter | 2016 | Casual | Titan Women Silver Watch |
3 | 21379 | Men | Apparel | Bottomwear | Track Pants | Black | Fall | 2011 | Casual | Manchester United Men Solid Black Track Pants |
4 | 53759 | Men | Apparel | Topwear | Tshirts | Grey | Summer | 2012 | Casual | Puma Men Grey T-shirt |
df["product_text"] = df.apply(lambda row: f"name {row['productDisplayName']} category {row['masterCategory']} subcategory {row['subCategory']} color {row['baseColour']} gender {row['gender']}".lower(), axis=1)
df.rename({"id":"product_id"}, inplace=True, axis=1)
df.info()
<class 'pandas.core.DataFrame'>
Index: 1978 entries, 0 to 1998
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_id 1978 non-null int64
1 gender 1978 non-null object
2 masterCategory 1978 non-null object
3 subCategory 1978 non-null object
4 articleType 1978 non-null object
5 baseColour 1978 non-null object
6 season 1978 non-null object
7 year 1978 non-null int64
8 usage 1978 non-null object
9 productDisplayName 1978 non-null object
10 product_text 1978 non-null object
dtypes: int64(2), object(9)
memory usage: 185.4+ KB
# 查看我们将用于创建语义嵌入的文本之一
df["product_text"][0]
'name turtle check men navy blue shirt category apparel subcategory topwear color navy blue gender men'
连接到 Redis
现在我们已经运行了 Redis 数据库,我们可以使用 Redis-py 客户端连接到它。我们将使用 Redis 数据库的默认主机和端口,即 localhost:6379
。
import redis
from redis.commands.search.indexDefinition import (
IndexDefinition,
IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
TagField,
NumericField,
TextField,
VectorField
)
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_PASSWORD = "" # default for passwordless Redis
# 连接到 Redis
redis_client = redis.Redis(
host=REDIS_HOST,
port=REDIS_PORT,
password=REDIS_PASSWORD
)
redis_client.ping()
True
在 Redis 中创建搜索索引
下面的单元格将展示如何指定和创建 Redis 中的搜索索引。我们将:
- 设置一些用于定义索引的常量,例如距离度量和索引名称
- 使用 RediSearch 字段定义索引架构
- 创建索引
# 常量
INDEX_NAME = "product_embeddings" # 搜索索引的名称
PREFIX = "doc" # 文档键的前缀
DISTANCE_METRIC = "L2" # 向量的距离度量(例如 COSINE、IP、L2)
NUMBER_OF_VECTORS = len(df)
# 为数据集中每个列定义 RediSearch 字段
name = TextField(name="productDisplayName")
category = TagField(name="masterCategory")
articleType = TagField(name="articleType")
gender = TagField(name="gender")
season = TagField(name="season")
year = NumericField(name="year")
text_embedding = VectorField("product_vector",
"FLAT", {
"TYPE": "FLOAT32",
"DIM": 1536,
"DISTANCE_METRIC": DISTANCE_METRIC,
"INITIAL_CAP": NUMBER_OF_VECTORS,
}
)
fields = [name, category, articleType, gender, season, year, text_embedding]
# 检查索引是否存在
try:
redis_client.ft(INDEX_NAME).info()
print("Index already exists")
except:
# 创建 RediSearch 索引
redis_client.ft(INDEX_NAME).create_index(
fields = fields,
definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
)
生成 OpenAI 嵌入并加载文档到索引中
现在我们有了搜索索引,我们可以加载文档到其中。我们将使用之前加载的包含样式数据集的数据框。在 Redis 中,可以使用 HASH 或 JSON(如果除了 RediSearch 还使用 RedisJSON)数据类型来存储文档。在本示例中,我们将使用 HASH 数据类型。下面的单元格将展示如何为不同的产品获取 OpenAI 嵌入并将文档加载到索引中。
# 使用 OpenAI 的 get_embeddings 批量请求来加速嵌入创建
def embeddings_batch_request(documents: pd.DataFrame):
records = documents.to_dict("records")
print("Records to process: ", len(records))
product_vectors = []
docs = []
batchsize = 1000
for idx,doc in enumerate(records,start=1):
# 创建字节向量
docs.append(doc["product_text"])
if idx % batchsize == 0:
product_vectors += get_embeddings(docs, EMBEDDING_MODEL)
docs.clear()
print("Vectors processed ", len(product_vectors), end='\r')
product_vectors += get_embeddings(docs, EMBEDDING_MODEL)
print("Vectors processed ", len(product_vectors), end='\r')
return product_vectors
def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
product_vectors = embeddings_batch_request(documents)
records = documents.to_dict("records")
batchsize = 500
# 使用 Redis 管道来批量处理调用并节省往返网络通信
pipe = client.pipeline()
for idx,doc in enumerate(records,start=1):
key = f"{prefix}:{str(doc['product_id'])}"
# 创建字节向量
text_embedding = np.array((product_vectors[idx-1]), dtype=np.float32).tobytes()
# 用字节向量替换浮点数列表
doc["product_vector"] = text_embedding
pipe.hset(key, mapping = doc)
if idx % batchsize == 0:
pipe.execute()
pipe.execute()
%%time
index_documents(redis_client, PREFIX, df)
print(f"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}")
Records to process: 1978
Loaded 1978 documents in Redis search index with name: product_embeddings
CPU times: user 619 ms, sys: 78.9 ms, total: 698 ms
Wall time: 3.34 s
使用 OpenAI 查询嵌入进行简单的向量搜索查询
现在我们有了搜索索引并将文档加载到其中,我们可以运行搜索查询。下面我们将提供一个函数来运行搜索查询并返回结果。使用此函数,我们将运行几个查询,展示如何将 Redis 用作向量数据库。
def search_redis(
redis_client: redis.Redis,
user_query: str,
index_name: str = "product_embeddings",
vector_field: str = "product_vector",
return_fields: list = ["productDisplayName", "masterCategory", "gender", "season", "year", "vector_score"],
hybrid_fields = "*",
k: int = 20,
print_results: bool = True,
) -> List[dict]:
# 使用 OpenAI 从用户查询创建嵌入向量
embedded_query = openai.Embedding.create(input=user_query,
model="text-embedding-3-small",
)["data"][0]['embedding']
# 准备查询
base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
query = (
Query(base_query)
.return_fields(*return_fields)
.sort_by("vector_score")
.paging(0, k)
.dialect(2)
)
params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}
# 执行向量搜索
results = redis_client.ft(index_name).search(query, params_dict)
if print_results:
for i, product in enumerate(results.docs):
score = 1 - float(product.vector_score)
print(f"{i}. {product.productDisplayName} (Score: {round(score ,3) })")
return results.docs
# 在 Redis 中执行简单的向量搜索
results = search_redis(redis_client, 'man blue jeans', k=10)
0. John Players Men Blue Jeans (Score: 0.791)
1. Lee Men Tino Blue Jeans (Score: 0.775)
2. Peter England Men Party Blue Jeans (Score: 0.763)
3. Lee Men Blue Chicago Fit Jeans (Score: 0.761)
4. Lee Men Blue Chicago Fit Jeans (Score: 0.761)
5. French Connection Men Blue Jeans (Score: 0.74)
6. Locomotive Men Washed Blue Jeans (Score: 0.739)
7. Locomotive Men Washed Blue Jeans (Score: 0.739)
8. Do U Speak Green Men Blue Shorts (Score: 0.736)
9. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.732)
使用 Redis 进行混合查询
前面的示例展示了如何使用 RediSearch 运行向量搜索查询。在本节中,我们将展示如何将向量搜索与其他 RediSearch 字段结合以进行混合搜索。在下面的示例中,我们将向量搜索与全文搜索结合起来。
# 通过添加混合查询“man blue jeans”到产品向量中,并结合短语搜索“blue jeans”来提高搜索质量
results = search_redis(redis_client,
"man blue jeans",
vector_field="product_vector",
k=10,
hybrid_fields='@productDisplayName:"blue jeans"'
)
0. John Players Men Blue Jeans (Score: 0.791)
1. Lee Men Tino Blue Jeans (Score: 0.775)
2. Peter England Men Party Blue Jeans (Score: 0.763)
3. French Connection Men Blue Jeans (Score: 0.74)
4. Locomotive Men Washed Blue Jeans (Score: 0.739)
5. Locomotive Men Washed Blue Jeans (Score: 0.739)
6. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.732)
7. Denizen Women Blue Jeans (Score: 0.725)
8. Jealous 21 Women Washed Blue Jeans (Score: 0.713)
9. Jealous 21 Women Washed Blue Jeans (Score: 0.713)
# 产品向量中的衬衫混合查询,并且只包含标题中包含“修身款”的搜索结果
results = search_redis(redis_client,
"shirt",
vector_field="product_vector",
k=10,
hybrid_fields='@productDisplayName:"slim fit"'
)
0. Basics Men White Slim Fit Striped Shirt (Score: 0.633)
1. ADIDAS Men's Slim Fit White T-shirt (Score: 0.628)
2. Basics Men Blue Slim Fit Checked Shirt (Score: 0.627)
3. Basics Men Blue Slim Fit Checked Shirt (Score: 0.627)
4. Basics Men Red Slim Fit Checked Shirt (Score: 0.623)
5. Basics Men Navy Slim Fit Checked Shirt (Score: 0.613)
6. Lee Rinse Navy Blue Slim Fit Jeans (Score: 0.558)
7. Tokyo Talkies Women Navy Slim Fit Jeans (Score: 0.552)
# 产品向量中的手表混合查询,并且只包含主类别字段中带有“Accessories”标签的搜索结果
results = search_redis(redis_client,
"watch",
vector_field="product_vector",
k=10,
hybrid_fields='@masterCategory:{Accessories}'
)
0. Titan Women Gold Watch (Score: 0.544)
1. Being Human Men Grey Dial Blue Strap Watch (Score: 0.544)
2. Police Men Black Dial Watch PL12170JSB (Score: 0.544)
3. Titan Men Black Watch (Score: 0.543)
4. Police Men Black Dial Chronograph Watch PL12778MSU-61 (Score: 0.542)
5. CASIO Youth Series Digital Men Black Small Dial Digital Watch W-210-1CVDF I065 (Score: 0.542)
6. Titan Women Silver Watch (Score: 0.542)
7. Police Men Black Dial Watch PL12778MSU-61 (Score: 0.541)
8. Titan Raga Women Gold Watch (Score: 0.539)
9. ADIDAS Original Men Black Dial Chronograph Watch ADH2641 (Score: 0.539)
# 产品向量中的凉鞋混合查询,并且只包含 2011-2012 年范围内的搜索结果
results = search_redis(redis_client,
"sandals",
vector_field="product_vector",
k=10,
hybrid_fields='@year:[2011 2012]'
)
0. Enroute Teens Orange Sandals (Score: 0.701)
1. Fila Men Camper Brown Sandals (Score: 0.692)
2. Clarks Men Black Leather Closed Sandals (Score: 0.691)
3. Coolers Men Black Sandals (Score: 0.69)
4. Coolers Men Black Sandals (Score: 0.69)
5. Enroute Teens Brown Sandals (Score: 0.69)
6. Crocs Dora Boots Pink Sandals (Score: 0.69)
7. Enroute Men Leather Black Sandals (Score: 0.685)
8. ADIDAS Men Navy Blue Benton Sandals (Score: 0.684)
9. Coolers Men Black Sports Sandals (Score: 0.684)
# 产品向量中的凉鞋混合查询,并且只包含 2011-2012 年范围内的搜索结果,来自夏季季节
results = search_redis(redis_client,
"blue sandals",
vector_field="product_vector",
k=10,
hybrid_fields='(@year:[2011 2012] @season:{Summer})'
)
0. ADIDAS Men Navy Blue Benton Sandals (Score: 0.691)
1. Enroute Teens Brown Sandals (Score: 0.681)
2. ADIDAS Women's Adi Groove Blue Flip Flop (Score: 0.672)
3. Enroute Women Turquoise Blue Flats (Score: 0.671)
4. Red Tape Men Black Sandals (Score: 0.67)
5. Enroute Teens Orange Sandals (Score: 0.661)
6. Vans Men Blue Era Scilla Plaid Shoes (Score: 0.658)
7. FILA Men Aruba Navy Blue Sandal (Score: 0.657)
8. Quiksilver Men Blue Flip Flops (Score: 0.656)
9. Reebok Men Navy Twist Sandals (Score: 0.656)
# 使用年份(NUMERIC)和特定文章类型(TAG)以及品牌名称(TEXT)过滤结果,对棕色皮带进行混合查询
results = search_redis(redis_client,
"brown belt",
vector_field="product_vector",
k=10,
hybrid_fields='(@year:[2012 2012] @articleType:{Shirts | Belts} @productDisplayName:"Wrangler")'
)
0. Wrangler Men Leather Brown Belt (Score: 0.67)
1. Wrangler Women Black Belt (Score: 0.639)
2. Wrangler Men Green Striped Shirt (Score: 0.575)
3. Wrangler Men Purple Striped Shirt (Score: 0.549)
4. Wrangler Men Griffith White Shirt (Score: 0.543)
5. Wrangler Women Stella Green Shirt (Score: 0.542)