使用 Redis 和 OpenAI 运行混合 VSS 查询

本笔记本介绍了如何将 Redis 用作向量数据库,并结合 OpenAI 嵌入,使用 Redis 查询和搜索功能运行结合了 VSS 和词汇搜索的混合查询。Redis 是一个可扩展的、实时的数据库,当使用 RediSearch 模块 时,它可以用作向量数据库。Redis 查询和搜索功能允许您在 Redis 中索引和搜索向量。本笔记本将向您展示如何使用 Redis 查询和搜索来索引和搜索通过 OpenAI API 创建并存储在 Redis 中的向量。

混合查询将向量相似性与传统的 Redis 查询和搜索功能(用于 GEO、NUMERIC、TAG 或 TEXT 数据)相结合,从而简化了应用程序代码。在电子商务用例中,混合查询的一个常见示例是查找在地理位置和价格范围内可用的商品中,与给定查询图像在视觉上相似的商品。

先决条件

在开始此项目之前,我们需要进行以下设置:

===========================================================

启动 Redis

为简单起见,我们将使用 Redis Stack docker 容器,可以按如下方式启动:

$ docker-compose up -d

这还包括用于管理 Redis 数据库的 RedisInsight GUI,在启动 docker 容器后,您可以在 http://localhost:8001 上查看它。

您已全部设置完毕,可以开始使用了!接下来,我们将导入并创建用于与我们刚刚创建的 Redis 数据库通信的客户端。

安装要求

Redis-Py 是用于与 Redis 通信的 Python 客户端。我们将使用它与我们的 Redis-stack 数据库进行通信。

! pip install redis pandas openai
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: redis in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (4.5.4)
Requirement already satisfied: pandas in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (2.0.1)
Requirement already satisfied: openai in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (0.27.6)
Requirement already satisfied: async-timeout>=4.0.2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from redis) (4.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3)
Requirement already satisfied: numpy>=1.20.3 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (1.23.4)
Requirement already satisfied: requests>=2.20 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from openai) (2.28.1)
Requirement already satisfied: tqdm in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from openai) (4.64.1)
Requirement already satisfied: aiohttp in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from openai) (3.8.4)
Requirement already satisfied: six>=1.5 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (1.26.12)
Requirement already satisfied: certifi>=2017.4.17 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (2022.9.24)
Requirement already satisfied: attrs>=17.3.0 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (23.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (1.3.1)

===========================================================

准备您的 OpenAI API 密钥

OpenAI API 密钥 用于对查询数据进行向量化。

如果您没有 OpenAI API 密钥,可以从 https://beta.openai.com/account/api-keys 获取。

获取密钥后,请使用以下命令将其作为 OPENAI_API_KEY 添加到环境变量中:

# 测试您的 OpenAI API 密钥是否已正确设置为环境变量
# 注意:如果您在本地运行此笔记本,则需要重新加载终端和笔记本才能使环境变量生效。
import os
import openai

os.environ["OPENAI_API_KEY"] = '<YOUR_OPENAI_API_KEY>'

if os.getenv("OPENAI_API_KEY") is not None:
    openai.api_key = os.getenv("OPENAI_API_KEY")
    print ("OPENAI_API_KEY is ready")
else:
    print ("OPENAI_API_KEY environment variable not found")
OPENAI_API_KEY is ready

加载数据

在本节中,我们将加载和清理电子商务数据集。我们将使用 OpenAI 生成嵌入,并使用这些数据在 Redis 中创建索引,然后搜索相似的向量。

import pandas as pd
import numpy as np
from typing import List

from utils.embeddings_utils import (
    get_embeddings,
    distances_from_embeddings,
    tsne_components_from_embeddings,
    chart_from_components,
    indices_of_nearest_neighbors_from_distances,
)

EMBEDDING_MODEL = "text-embedding-3-small"

# 加载数据并清理数据类型以及删除空行
df = pd.read_csv("../../data/styles_2k.csv", on_bad_lines='skip')
df.dropna(inplace=True)
df["year"] = df["year"].astype(int)
df.info()

# 打印数据框
n_examples = 5
df.head(n_examples)
<class 'pandas.core.frame.DataFrame'>
Index: 1978 entries, 0 to 1998
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   id                  1978 non-null   int64
 1   gender              1978 non-null   object
 2   masterCategory      1978 non-null   object
 3   subCategory         1978 non-null   object
 4   articleType         1978 non-null   object
 5   baseColour          1978 non-null   object
 6   season              1978 non-null   object
 7   year                1978 non-null   int64
 8   usage               1978 non-null   object
 9   productDisplayName  1978 non-null   object
dtypes: int64(2), object(8)
memory usage: 170.0+ KB
id gender masterCategory subCategory articleType baseColour season year usage productDisplayName
0 15970 Men Apparel Topwear Shirts Navy Blue Fall 2011 Casual Turtle Check Men Navy Blue Shirt
1 39386 Men Apparel Bottomwear Jeans Blue Summer 2012 Casual Peter England Men Party Blue Jeans
2 59263 Women Accessories Watches Watches Silver Winter 2016 Casual Titan Women Silver Watch
3 21379 Men Apparel Bottomwear Track Pants Black Fall 2011 Casual Manchester United Men Solid Black Track Pants
4 53759 Men Apparel Topwear Tshirts Grey Summer 2012 Casual Puma Men Grey T-shirt
df["product_text"] = df.apply(lambda row: f"name {row['productDisplayName']} category {row['masterCategory']} subcategory {row['subCategory']} color {row['baseColour']} gender {row['gender']}".lower(), axis=1)
df.rename({"id":"product_id"}, inplace=True, axis=1)

df.info()
<class 'pandas.core.DataFrame'>
Index: 1978 entries, 0 to 1998
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   product_id          1978 non-null   int64
 1   gender              1978 non-null   object
 2   masterCategory      1978 non-null   object
 3   subCategory         1978 non-null   object
 4   articleType         1978 non-null   object
 5   baseColour          1978 non-null   object
 6   season              1978 non-null   object
 7   year                1978 non-null   int64
 8   usage               1978 non-null   object
 9   productDisplayName  1978 non-null   object
 10  product_text        1978 non-null   object
dtypes: int64(2), object(9)
memory usage: 185.4+ KB
# 查看我们将用于创建语义嵌入的文本之一
df["product_text"][0]
'name turtle check men navy blue shirt category apparel subcategory topwear color navy blue gender men'

连接到 Redis

现在我们已经运行了 Redis 数据库,我们可以使用 Redis-py 客户端连接到它。我们将使用 Redis 数据库的默认主机和端口,即 localhost:6379

import redis
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TagField,
    NumericField,
    TextField,
    VectorField
)

REDIS_HOST =  "localhost"
REDIS_PORT = 6379
REDIS_PASSWORD = "" # default for passwordless Redis

# 连接到 Redis
redis_client = redis.Redis(
    host=REDIS_HOST,
    port=REDIS_PORT,
    password=REDIS_PASSWORD
)
redis_client.ping()
True

在 Redis 中创建搜索索引

下面的单元格将展示如何指定和创建 Redis 中的搜索索引。我们将:

  1. 设置一些用于定义索引的常量,例如距离度量和索引名称
  2. 使用 RediSearch 字段定义索引架构
  3. 创建索引
# 常量
INDEX_NAME = "product_embeddings"           # 搜索索引的名称
PREFIX = "doc"                            # 文档键的前缀
DISTANCE_METRIC = "L2"                # 向量的距离度量(例如 COSINE、IP、L2)
NUMBER_OF_VECTORS = len(df)
# 为数据集中每个列定义 RediSearch 字段
name = TextField(name="productDisplayName")
category = TagField(name="masterCategory")
articleType = TagField(name="articleType")
gender = TagField(name="gender")
season = TagField(name="season")
year = NumericField(name="year")
text_embedding = VectorField("product_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": 1536,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": NUMBER_OF_VECTORS,
    }
)
fields = [name, category, articleType, gender, season, year, text_embedding]
# 检查索引是否存在
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except:
    # 创建 RediSearch 索引
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
)

生成 OpenAI 嵌入并加载文档到索引中

现在我们有了搜索索引,我们可以加载文档到其中。我们将使用之前加载的包含样式数据集的数据框。在 Redis 中,可以使用 HASH 或 JSON(如果除了 RediSearch 还使用 RedisJSON)数据类型来存储文档。在本示例中,我们将使用 HASH 数据类型。下面的单元格将展示如何为不同的产品获取 OpenAI 嵌入并将文档加载到索引中。

# 使用 OpenAI 的 get_embeddings 批量请求来加速嵌入创建
def embeddings_batch_request(documents: pd.DataFrame):
    records = documents.to_dict("records")
    print("Records to process: ", len(records))
    product_vectors = []
    docs = []
    batchsize = 1000

    for idx,doc in enumerate(records,start=1):
        # 创建字节向量
        docs.append(doc["product_text"])
        if idx % batchsize == 0:
            product_vectors += get_embeddings(docs, EMBEDDING_MODEL)
            docs.clear()
            print("Vectors processed ", len(product_vectors), end='\r')
    product_vectors += get_embeddings(docs, EMBEDDING_MODEL)
    print("Vectors processed ", len(product_vectors), end='\r')
    return product_vectors
def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
    product_vectors = embeddings_batch_request(documents)
    records = documents.to_dict("records")
    batchsize = 500

    # 使用 Redis 管道来批量处理调用并节省往返网络通信
    pipe = client.pipeline()
    for idx,doc in enumerate(records,start=1):
        key = f"{prefix}:{str(doc['product_id'])}"

        # 创建字节向量
        text_embedding = np.array((product_vectors[idx-1]), dtype=np.float32).tobytes()

        # 用字节向量替换浮点数列表
        doc["product_vector"] = text_embedding

        pipe.hset(key, mapping = doc)
        if idx % batchsize == 0:
            pipe.execute()
    pipe.execute()
%%time
index_documents(redis_client, PREFIX, df)
print(f"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}")
Records to process:  1978
Loaded 1978 documents in Redis search index with name: product_embeddings
CPU times: user 619 ms, sys: 78.9 ms, total: 698 ms
Wall time: 3.34 s

使用 OpenAI 查询嵌入进行简单的向量搜索查询

现在我们有了搜索索引并将文档加载到其中,我们可以运行搜索查询。下面我们将提供一个函数来运行搜索查询并返回结果。使用此函数,我们将运行几个查询,展示如何将 Redis 用作向量数据库。

def search_redis(
    redis_client: redis.Redis,
    user_query: str,
    index_name: str = "product_embeddings",
    vector_field: str = "product_vector",
    return_fields: list = ["productDisplayName", "masterCategory", "gender", "season", "year", "vector_score"],
    hybrid_fields = "*",
    k: int = 20,
    print_results: bool = True,
) -> List[dict]:

    # 使用 OpenAI 从用户查询创建嵌入向量
    embedded_query = openai.Embedding.create(input=user_query,
                                            model="text-embedding-3-small",
                                            )["data"][0]['embedding']

    # 准备查询
    base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
    query = (
        Query(base_query)
         .return_fields(*return_fields)
         .sort_by("vector_score")
         .paging(0, k)
         .dialect(2)
    )
    params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}

    # 执行向量搜索
    results = redis_client.ft(index_name).search(query, params_dict)
    if print_results:
        for i, product in enumerate(results.docs):
            score = 1 - float(product.vector_score)
            print(f"{i}. {product.productDisplayName} (Score: {round(score ,3) })")
    return results.docs
# 在 Redis 中执行简单的向量搜索
results = search_redis(redis_client, 'man blue jeans', k=10)
0. John Players Men Blue Jeans (Score: 0.791)
1. Lee Men Tino Blue Jeans (Score: 0.775)
2. Peter England Men Party Blue Jeans (Score: 0.763)
3. Lee Men Blue Chicago Fit Jeans (Score: 0.761)
4. Lee Men Blue Chicago Fit Jeans (Score: 0.761)
5. French Connection Men Blue Jeans (Score: 0.74)
6. Locomotive Men Washed Blue Jeans (Score: 0.739)
7. Locomotive Men Washed Blue Jeans (Score: 0.739)
8. Do U Speak Green Men Blue Shorts (Score: 0.736)
9. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.732)

使用 Redis 进行混合查询

前面的示例展示了如何使用 RediSearch 运行向量搜索查询。在本节中,我们将展示如何将向量搜索与其他 RediSearch 字段结合以进行混合搜索。在下面的示例中,我们将向量搜索与全文搜索结合起来。

# 通过添加混合查询“man blue jeans”到产品向量中,并结合短语搜索“blue jeans”来提高搜索质量
results = search_redis(redis_client,
                       "man blue jeans",
                       vector_field="product_vector",
                       k=10,
                       hybrid_fields='@productDisplayName:"blue jeans"'
                       )
0. John Players Men Blue Jeans (Score: 0.791)
1. Lee Men Tino Blue Jeans (Score: 0.775)
2. Peter England Men Party Blue Jeans (Score: 0.763)
3. French Connection Men Blue Jeans (Score: 0.74)
4. Locomotive Men Washed Blue Jeans (Score: 0.739)
5. Locomotive Men Washed Blue Jeans (Score: 0.739)
6. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.732)
7. Denizen Women Blue Jeans (Score: 0.725)
8. Jealous 21 Women Washed Blue Jeans (Score: 0.713)
9. Jealous 21 Women Washed Blue Jeans (Score: 0.713)
# 产品向量中的衬衫混合查询,并且只包含标题中包含“修身款”的搜索结果
results = search_redis(redis_client,
                       "shirt",
                       vector_field="product_vector",
                       k=10,
                       hybrid_fields='@productDisplayName:"slim fit"'
                       )
0. Basics Men White Slim Fit Striped Shirt (Score: 0.633)
1. ADIDAS Men's Slim Fit White T-shirt (Score: 0.628)
2. Basics Men Blue Slim Fit Checked Shirt (Score: 0.627)
3. Basics Men Blue Slim Fit Checked Shirt (Score: 0.627)
4. Basics Men Red Slim Fit Checked Shirt (Score: 0.623)
5. Basics Men Navy Slim Fit Checked Shirt (Score: 0.613)
6. Lee Rinse Navy Blue Slim Fit Jeans (Score: 0.558)
7. Tokyo Talkies Women Navy Slim Fit Jeans (Score: 0.552)
# 产品向量中的手表混合查询,并且只包含主类别字段中带有“Accessories”标签的搜索结果
results = search_redis(redis_client,
                       "watch",
                       vector_field="product_vector",
                       k=10,
                       hybrid_fields='@masterCategory:{Accessories}'
                       )
0. Titan Women Gold Watch (Score: 0.544)
1. Being Human Men Grey Dial Blue Strap Watch (Score: 0.544)
2. Police Men Black Dial Watch PL12170JSB (Score: 0.544)
3. Titan Men Black Watch (Score: 0.543)
4. Police Men Black Dial Chronograph Watch PL12778MSU-61 (Score: 0.542)
5. CASIO Youth Series Digital Men Black Small Dial Digital Watch W-210-1CVDF I065 (Score: 0.542)
6. Titan Women Silver Watch (Score: 0.542)
7. Police Men Black Dial Watch PL12778MSU-61 (Score: 0.541)
8. Titan Raga Women Gold Watch (Score: 0.539)
9. ADIDAS Original Men Black Dial Chronograph Watch ADH2641 (Score: 0.539)
# 产品向量中的凉鞋混合查询,并且只包含 2011-2012 年范围内的搜索结果
results = search_redis(redis_client,
                       "sandals",
                       vector_field="product_vector",
                       k=10,
                       hybrid_fields='@year:[2011 2012]'
                       )
0. Enroute Teens Orange Sandals (Score: 0.701)
1. Fila Men Camper Brown Sandals (Score: 0.692)
2. Clarks Men Black Leather Closed Sandals (Score: 0.691)
3. Coolers Men Black Sandals (Score: 0.69)
4. Coolers Men Black Sandals (Score: 0.69)
5. Enroute Teens Brown Sandals (Score: 0.69)
6. Crocs Dora Boots Pink Sandals (Score: 0.69)
7. Enroute Men Leather Black Sandals (Score: 0.685)
8. ADIDAS Men Navy Blue Benton Sandals (Score: 0.684)
9. Coolers Men Black Sports Sandals (Score: 0.684)
# 产品向量中的凉鞋混合查询,并且只包含 2011-2012 年范围内的搜索结果,来自夏季季节
results = search_redis(redis_client,
                       "blue sandals",
                       vector_field="product_vector",
                       k=10,
                       hybrid_fields='(@year:[2011 2012] @season:{Summer})'
                       )
0. ADIDAS Men Navy Blue Benton Sandals (Score: 0.691)
1. Enroute Teens Brown Sandals (Score: 0.681)
2. ADIDAS Women's Adi Groove Blue Flip Flop (Score: 0.672)
3. Enroute Women Turquoise Blue Flats (Score: 0.671)
4. Red Tape Men Black Sandals (Score: 0.67)
5. Enroute Teens Orange Sandals (Score: 0.661)
6. Vans Men Blue Era Scilla Plaid Shoes (Score: 0.658)
7. FILA Men Aruba Navy Blue Sandal (Score: 0.657)
8. Quiksilver Men Blue Flip Flops (Score: 0.656)
9. Reebok Men Navy Twist Sandals (Score: 0.656)
# 使用年份(NUMERIC)和特定文章类型(TAG)以及品牌名称(TEXT)过滤结果,对棕色皮带进行混合查询
results = search_redis(redis_client,
                       "brown belt",
                       vector_field="product_vector",
                       k=10,
                       hybrid_fields='(@year:[2012 2012] @articleType:{Shirts | Belts} @productDisplayName:"Wrangler")'
                       )
0. Wrangler Men Leather Brown Belt (Score: 0.67)
1. Wrangler Women Black Belt (Score: 0.639)
2. Wrangler Men Green Striped Shirt (Score: 0.575)
3. Wrangler Men Purple Striped Shirt (Score: 0.549)
4. Wrangler Men Griffith White Shirt (Score: 0.543)
5. Wrangler Women Stella Green Shirt (Score: 0.542)