多模态RAG与CLIP嵌入和GPT-4视觉

多模态RAG将附加模态集成到传统的基于文本的RAG中，通过提供额外的上下文和基础文本数据来增强LLM的问答能力，以提高理解能力。

采用服装搭配师食谱中的方法，我们直接嵌入图像进行相似性搜索，绕过了有损的文本字幕过程，以提高检索准确性。

使用基于CLIP的嵌入还可以通过特定数据进行微调，或使用未见过的图像进行更新。

通过搜索企业知识库并使用用户提供的技术图像来提供相关信息来展示这种技术。

安装

首先，我们安装相关软件包。

#安装
%pip install clip
%pip install torch
%pip install pillow
%pip install faiss-cpu
%pip install numpy
%pip install git+https://github.com/openai/CLIP.git
%pip install openai

然后，我们导入所有需要的包。

# 模型导入
import faiss
import json
import torch
from openai import OpenAI
import torch.nn as nn
from torch.utils.data import DataLoader
import clip
client = OpenAI()

# 辅助导入
from tqdm import tqdm
import json
import os
import numpy as np
import pickle
from typing import List, Union, Tuple

# 可视化导入
from PIL import Image
import matplotlib.pyplot as plt
import base64

现在，我们加载CLIP模型。

#在设备上加载模型。您正在运行推理/训练的设备是CPU或GPU（如果您有）。
device = "cpu"
model, preprocess = clip.load("ViT-B/32",device=device)

我们将执行以下操作：

创建图像嵌入数据库
设置对视觉模型的查询
执行语义搜索
将用户查询传递给图像

创建图像嵌入数据库

接下来，我们将从图像目录创建图像嵌入知识库。这将是我们搜索技术以向用户提供他们上传的图像信息的知识库。

我们传入存储图像（JPEG格式）的目录，并逐一循环创建嵌入。

我们还有一个description.json文件。它包含我们知识库中每个图像的条目。它有两个键：“image_path”和“description”。它将每个图像映射到一个有用的描述，以帮助回答用户的问题。

首先，让我们编写一个函数来获取给定目录中的所有图像路径。然后，我们将从名为“image_database”的目录中获取所有jpeg文件。

def get_image_paths(directory: str, number: int = None) -> List[str]:
    image_paths = []
    count = 0
    for filename in os.listdir(directory):
        if filename.endswith('.jpeg'):
            image_paths.append(os.path.join(directory, filename))
            if number is not None and count == number:
                return [image_paths[-1]]
            count += 1
    return image_paths
direc = 'image_database/'
image_paths = get_image_paths(direc)

接下来，我们将编写一个函数来获取给定路径的CLIP模型的图像嵌入。

我们首先使用前面获得的预处理函数来预处理图像。这执行了几个操作，以确保CLIP模型的输入格式和维度正确，包括调整大小、归一化、颜色通道调整等。

然后，我们将这些预处理的图像堆叠起来，以便一次性将它们输入模型，而不是逐个处理。最后返回模型输出，即嵌入数组。

def get_features_from_image_path(image_paths):
  images = [preprocess(Image.open(image_path).convert("RGB")) for image_path in image_paths]
  image_input = torch.tensor(np.stack(images))
  with torch.no_grad():
    image_features = model.encode_image(image_input).float()
  return image_features
image_features = get_features_from_image_path(image_paths)

现在我们可以创建我们的向量数据库。

index = faiss.IndexFlatIP(image_features.shape[1])
index.add(image_features)

我们还摄取我们的json文件以进行图像-描述映射，并创建一个json列表。我们还创建了一个辅助函数来搜索此列表以获取我们想要的图像，以便我们可以获得该图像的描述。

data = []
image_path = 'train1.jpeg'
with open('description.json', 'r') as file:
    for line in file:
        data.append(json.loads(line))
def find_entry(data, key, value):
    for entry in data:
        if entry.get(key) == value:
            return entry
    return None

让我们显示一个示例图像，这将是用户上传的图像。这是2024年CES上发布的一款产品。它是DELTA Pro Ultra全屋电池发电机。

im = Image.open(image_path)
plt.imshow(im)
plt.show()

Delta Pro

查询视觉模型

现在让我们看看GPT-4 Vision（它可能没有见过这种技术）会如何标记它。

首先，我们需要编写一个函数来编码我们的图像为base64，因为这是我们将传递给视觉模型的格式。然后，我们将创建一个通用的image_query函数，允许我们使用图像输入来查询LLM。

def encode_image(image_path):
    with open(image_path, 'rb') as image_file:
        encoded_image = base64.b64encode(image_file.read())
        return encoded_image.decode('utf-8')

def image_query(query, image_path):
    response = client.chat.completions.create(
        model='gpt-4-vision-preview',
        messages=[
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": query,
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encode_image(image_path)}",
                },
                }
            ],
            }
        ],
        max_tokens=300,
    )
    # Extract relevant features from the response
    return response.choices[0].message.content
image_query('Write a short label of what is show in this image?', image_path)

'自主配送机器人'

正如我们所见，它尽力根据其训练数据中的信息进行判断，但由于未见过类似内容而犯了错误。这是因为图像具有模糊性，难以推断和推导。

执行语义搜索

现在，让我们执行相似性搜索，以在我们的知识库中找到两个最相似的图像。我们通过获取用户输入的图像路径的嵌入，检索数据库中相似图像的索引和距离来实现。距离将是我们相似性的代理指标，距离越小表示越相似。然后我们按距离降序排序。

image_search_embedding = get_features_from_image_path([image_path])
distances, indices = index.search(image_search_embedding.reshape(1, -1), 2) #2表示要返回的顶部相似图像的数量
distances = distances[0]
indices = indices[0]
indices_distances = list(zip(indices, distances))
indices_distances.sort(key=lambda x: x[1], reverse=True)

我们需要索引，因为我们将使用它来搜索我们的图像目录，并选择索引位置的图像输入到视觉模型进行RAG。

让我们看看它找回了什么（我们按相似度顺序显示它们）：

#显示相似图像
for idx, distance in indices_distances:
    print(idx)
    path = get_image_paths(direc, idx)[0]
    im = Image.open(path)
    plt.imshow(im)
    plt.show()

Delta Pro2

Delta Pro3

我们在这里可以看到它找回了两张包含DELTA Pro Ultra全屋电池发电机的图片。其中一张图片中还有一些可能分散注意力的背景，但它设法找到了正确的图片。

用户查询最相似的图像

现在，对于我们最相似的图像，我们想将它和它的描述一起传递给gpt-v，并附带用户查询，以便他们可以询问他们可能购买的技术。这就是视觉模型的力量所在，您可以向模型提出它没有经过显式训练的一般性查询，并且它会以高准确度进行响应。

在我们下面的示例中，我们将询问所讨论物品的容量。

similar_path = get_image_paths(direc, indices_distances[0][0])[0]
element = find_entry(data, 'image_path', similar_path)

user_query = 'What is the capacity of this item?'
prompt = f"""
Below is a user query, I want you to answer the query using the description and image provided.

user query:
{user_query}

description:
{element['description']}
"""
image_query(prompt, similar_path)

'便携式家用电池DELTA Pro的基础容量为3.6kWh。通过额外的电池，此容量可扩展至25kWh。图像展示了DELTA Pro，其交流输出功率容量也高达3600W。'

我们看到它能够回答这个问题。这只有通过直接匹配图像，然后从那里收集相关描述作为上下文才成为可能。

结论

在本笔记本中，我们介绍了如何使用CLIP模型，通过CLIP模型创建图像嵌入数据库的示例，执行语义搜索，最后向用户查询以回答问题。

这种使用模式的应用遍及许多不同的应用领域，并且可以轻松地进行改进以进一步增强该技术。例如，您可以微调CLIP，您可以像在RAG中一样改进检索过程，并且您可以进行GPT-V的提示工程。