使用 Pinecone 进行检索增强生成问答

修复幻觉 LLM

在本笔记本中，我们将学习如何从 Pinecone 查询与我们的查询相关的上下文，并将这些上下文传递给 OpenAI 的生成模型，以生成基于真实数据源的答案。

使用 GPT-3 来事实性地回答问题时，一个常见的问题是 GPT-3 有时会编造信息。GPT 模型拥有广泛的通用知识，但这并不一定适用于更具体的信息。为此，我们使用 Pinecone 向量数据库作为我们的 “外部知识库” — 就像 GPT-3 的 长期记忆。

此笔记本所需的安装如下：

!pip install -qU openai pinecone-client datasets

  [?25l      [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [0m  [32m0.0/55.3 KB [0m  [31m? [0m eta  [36m-:--:-- [0m [2K      [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [0m  [32m55.3/55.3 KB [0m  [31m1.7 MB/s [0m eta  [36m0:00:00 [0m
  [?25h  Installing build dependencies ...  [?25l [?25hdone
  Getting requirements to build wheel ...  [?25l [?25hdone
  Preparing metadata (pyproject.toml) ...  [?25l [?25hdone
 [2K      [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [0m  [32m170.6/170.6 KB [0m  [31m13.7 MB/s [0m eta  [36m0:00:00 [0m
 [2K      [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [0m  [32m452.9/452.9 KB [0m  [31m30.4 MB/s [0m eta  [36m0:00:00 [0m
 [2K      [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [0m  [32m58.3/58.3 KB [0m  [31m6.8 MB/s [0m eta  [36m0:00:00 [0m
 [2K      [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [0m  [32m213.0/213.0 KB [0m  [31m17.3 MB/s [0m eta  [36m0:00:00 [0m
 [2K      [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [0m  [32m132.0/132.0 KB [0m  [31m13.7 MB/s [0m eta  [36m0:00:00 [0m
 [2K      [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [0m  [32m182.4/182.4 KB [0m  [31m18.6 MB/s [0m eta  [36m0:00:00 [0m
 [2K      [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [0m  [32m140.6/140.6 KB [0m  [31m6.7 MB/s [0m eta  [36m0:00:00 [0m
 [?25h  Building wheel for openai (pyproject.toml) ...  [?25l [?25hdone

```python
import openai

# 从 OpenAI 网站右上角的下拉菜单中获取 API 密钥
openai.api_key = "OPENAI_API_KEY"

对于许多问题，最先进的 (SOTA) LLM 完全有能力正确回答。

query = "月球上的第12个人是谁？他们何时着陆？"

# 现在查询 `gpt-3.5-turbo-instruct` 而不带上下文
res = openai.Completion.create(
    engine='gpt-3.5-turbo-instruct',
    prompt=query,
    temperature=0,
    max_tokens=400,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

res['choices'][0]['text'].strip()

'月球上的第12个人是哈里森·施密特，他于1972年12月11日着陆。'

然而，情况并非总是如此。首先，让我们将上面的代码重写为一个简单的函数，这样我们就不会每次都重写它。

def complete(prompt):
    res = openai.Completion.create(
        engine='gpt-3.5-turbo-instruct',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

现在让我们问一个关于训练一种称为句子转换器的 Transformer 模型更具体的问题。我们期望得到的理想答案是 “多负相关排名（MNR）损失”。

如果您不熟悉这个术语，请不用担心，理解我们在这里做什么或演示什么并不需要。

query = (
    "当只有相关句子对时，我应该为句子转换器使用哪种训练方法？"
)

complete(query)

'如果你只有相关句子的对，那么为句子转换器使用的最佳训练方法是监督学习方法。这种方法涉及向模型提供标记数据，例如相关句子的对，然后训练模型来学习句子之间的关系。这种方法通常用于自然语言推理、语义相似性和释义识别等任务。'

我们得到的常见答案之一是：

用于使用句子转换器微调预训练模型的最佳训练方法是掩码语言模型（MLM）训练。MLM 训练涉及随机屏蔽句子中的某些单词，然后训练模型来预测被屏蔽的单词。这有助于模型学习句子的上下文并更好地理解单词之间的关系。

这个答案看起来很有说服力，对吧？然而，它是错误的。MLM 通常在 Transformer 模型的预训练阶段使用，但不能用于微调句子转换器，并且与拥有_“相关句子对”_无关。

我们收到的另一个答案（也是我们上面返回的答案）是关于 监督学习方法 最合适。这完全正确，但不够具体，也没有回答问题。

我们有两种选择可以使我们的 LLM 理解并正确回答这个问题：

我们在涵盖所提及主题的文本数据上微调 LLM，可能是在讨论句子转换器、语义搜索训练方法等的文章和论文上。
我们使用检索增强生成（RAG），这是一种将信息检索组件集成到生成过程中的技术。这使我们能够检索相关信息，并将这些信息作为辅助信息源输入到生成模型中。

我们将演示选项 2。

构建知识库

对于选项 2，检索相关信息需要一个外部_“知识库”，一个我们可以存储和用于高效检索信息的地方。我们可以将其视为 LLM 的外部_长期记忆。

我们需要检索与我们的查询在语义上相关的信息，为此我们需要使用_“密集向量嵌入”_。这些可以被认为是句子背后含义的数值表示。

为了创建这些密集向量，我们使用 text-embedding-3-small 模型。

我们已经对 OpenAI 连接进行了身份验证，要创建嵌入，只需执行以下操作：

embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "此处放置示例文档文本",
        "批次中将有几句话"
    ], engine=embed_model
)

在响应 res 中，我们将在 'data' 字段中找到一个类似 JSON 的对象，其中包含我们的新嵌入。

res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

在 'data' 中，我们将找到两条记录，分别对应我们刚刚嵌入的两句话。每个向量嵌入包含 1536 个维度（text-embedding-3-small 模型的输出维度）。

len(res['data'])

len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

我们将把相同的嵌入逻辑应用于包含与我们的查询（以及许多关于 ML 和 AI 主题的其他查询）相关的信息的数据集。

数据准备

我们将使用的数据集是 Hugging Face Datasets 中的 jamescalam/youtube-transcriptions。它包含来自几个 ML 和科技 YouTube 频道的转录音频。我们使用以下命令下载它：

from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

Using custom data configuration jamescalam--youtube-transcriptions-6a482f3df0aedcdb
Reusing dataset json (/Users/jamesbriggs/.cache/huggingface/datasets/jamescalam___json/jamescalam--youtube-transcriptions-6a482f3df0aedcdb/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)

Dataset({
    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
    num_rows: 208619
})

data[0]

{'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'published': '2021-07-06 13:00:03 UTC',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'video_id': '35Pdoyi6ZoQ',
 'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
 'id': '35Pdoyi6ZoQ-t0.0',
 'text': 'Hi, welcome to the video.',
 'start': 0.0,
 'end': 9.36}

该数据集包含许多小的文本片段。我们需要合并每个视频的许多片段，以创建更实质性的文本块，其中包含更多信息。

from tqdm.auto import tqdm

new_data = []

window = 20  # 要合并的句子数量
stride = 4  # 要“跨越”的句子数量，用于创建重叠

for i in tqdm(range(0, len(data), stride)):
    # 查找批次的结束
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # 在这种情况下，我们跳过此条目，因为我们有两个视频的开始/结束
        continue
    text = ' '.join(data[i:i_end]['text'])
    # 创建新的合并数据集
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published'],
        'channel_id': data[i]['channel_id']
    })

  0%|          | 0/52155 [00:00<?, ?it/s]

new_data[0]

{'start': 0.0,
 'end': 74.12,
 'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'text': "Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.",
 'id': '35Pdoyi6ZoQ-t0.0',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'published': '2021-07-06 13:00:03 UTC',
 'channel_id': 'UCv83tO5cePwHMt1952IVVHw'}

现在我们需要一个地方来存储这些嵌入并实现高效的_向量搜索_。为此，我们使用 Pinecone，我们可以获取一个免费 API 密钥并在下面输入，我们将在此初始化与 Pinecone 的连接并创建一个新索引。

import pinecone

index_name = 'openai-youtube-transcriptions'

# 初始化与 pinecone 的连接（在 app.pinecone.io 上获取 API 密钥）
pinecone.init(
    api_key="PINECONE_API_KEY",
    environment="us-east1-gcp"  # 可能不同，请在 app.pinecone.io 上查看
)

# 检查索引是否已存在（如果是第一次，则不应存在）
if index_name not in pinecone.list_indexes():
    # 如果不存在，则创建索引
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='cosine',
        metadata_config={'indexed': ['channel_id', 'published']}
    )
# 连接到索引
index = pinecone.Index(index_name)
# 查看索引统计信息
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

我们可以看到索引目前是空的，total_vector_count 为 0。我们可以开始用 OpenAI text-embedding-3-small 构建的嵌入填充它，如下所示：

from tqdm.auto import tqdm
from time import sleep

batch_size = 100  # 我们一次创建和插入多少个嵌入

for i in tqdm(range(0, len(new_data), batch_size)):
    # 查找批次的结束
    i_end = min(len(new_data), i+batch_size)
    meta_batch = new_data[i:i_end]
    # 获取 ID
    ids_batch = [x['id'] for x in meta_batch]
    # 获取要编码的文本
    texts = [x['text'] for x in meta_batch]
    # 创建嵌入（添加 try-except 以避免 RateLimitError）
    done = False
    while not done:
        try:
            res = openai.Embedding.create(input=texts, engine=embed_model)
            done = True
        except:
            sleep(5)
    embeds = [record['embedding'] for record in res['data']]
    # 清理元数据
    meta_batch = [{
        'start': x['start'],
        'end': x['end'],
        'title': x['title'],
        'text': x['text'],
        'url': x['url'],
        'published': x['published'],
        'channel_id': x['channel_id']
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # 插入到 Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/487 [00:00<?, ?it/s]

现在我们进行搜索，为此我们需要创建一个_查询向量_ xq：

res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# 从 Pinecone 检索
xq = res['data'][0]['embedding']

# 获取相关上下文（包括问题）
res = index.query(xq, top_k=2, include_metadata=True)

res

{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 568.4,
                           'published': datetime.date(2021, 11, 24),
                           'start': 418.88,
                           'text': 'pairs of related sentences you can go '
                                   'ahead and actually try training or '
                                   'fine-tuning using NLI with multiple '
                                   "negative ranking loss. If you don't have "
                                   'that fine. Another option is that you have '
                                   'a semantic textual similarity data set or '
                                   'STS and what this is is you have so you '
                                   'have sentence A here, sentence B here and '
                                   'then you have a score from from 0 to 1 '
                                   'that tells you the similarity between '
                                   'those two scores and you would train this '
                                   'using something like cosine similarity '
                                   "loss. Now if that's not an option and your "
                                   'focus or use case is on building a '
                                   'sentence transformer for another language '
                                   'where there is no current sentence '
                                   'transformer you can use multilingual '
                                   'parallel data. So what I mean by that is '
                                   'so parallel data just means translation '
                                   'pairs so if you have for example a English '
                                   'sentence and then you have another '
                                   'language here so it can it can be anything '
                                   "I'm just going to put XX and that XX is "
                                   'your target language you can fine-tune a '
                                   'model using something called multilingual '
                                   'knowledge distillation and what that does '
                                   'is takes a monolingual model for example '
                                   'in English and using those translation '
                                   'pairs it distills the knowledge the '
                                   'semantic similarity knowledge from that '
                                   'monolingual English model into a '
                                   'multilingual model which can handle both '
                                   'English and your target language. So '
                                   "they're three options quite popular very "
                                   'common that you can go for and as a '
                                   'supervised methods the chances are that '
                                   'probably going to outperform anything you '
                                   'do with unsupervised training at least for '
                                   'now. So if none of those sound like '
                                   'something',
                           'title': 'Today Unsupervised Sentence Transformers, '
                                    'Tomorrow Skynet (how TSDAE works)',
                           'url': 'https://youtu.be/pNvujJ1XyeQ'},
              'score': 0.865277052,
              'sparseValues': {},
              'values': []},
             {'id': 'WS1uVMGhlWQ-t737.28',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 900.72,
                           'published': datetime.date(2021, 10, 20),
                           'start': 737.28,
                           'text': "were actually more accurate. So we can't "
                                   "really do that. We can't use this what is "
                                   'called a mean pooling approach. Or we '
                                   "can't use it in its current form. Now the "
                                   'solution to this problem was introduced by '
                                   'two people in 2019 Nils Reimers and Irenia '
                                   'Gurevich. They introduced what is the '
                                   'first sentence transformer or sentence '
                                   'BERT. And it was found that sentence BERT '
                                   'or S BERT outformed all of the previous '
                                   'Save the Art models on pretty much all '
                                   'benchmarks. Not all of them but most of '
                                   'them. And it did it in a very quick time. '
                                   'So if we compare it to BERT, if we wanted '
                                   'to find the most similar sentence pair '
                                   'from 10,000 sentences in that 2019 paper '
                                   'they found that with BERT that took 65 '
                                   'hours. With S BERT embeddings they could '
                                   'create all the embeddings in just around '
                                   'five seconds. And then they could compare '
                                   'all those with cosine similarity in 0.01 '
                                   "seconds. So it's a lot faster. We go from "
                                   '65 hours to just over five seconds which '
                                   'is I think pretty incredible. Now I think '
                                   "that's pretty much all the context we need "
                                   'behind sentence transformers. And what we '
                                   'do now is dive into a little bit of how '
                                   'they actually work. Now we said before we '
                                   'have the core transform models and what S '
                                   'BERT does is fine tunes on sentence pairs '
                                   'using what is called a Siamese '
                                   'architecture or Siamese network. What we '
                                   'mean by a Siamese network is that we have '
                                   'what we can see, what can view as two BERT '
                                   'models that are identical and the weights '
                                   'between those two models are tied. Now in '
                                   'reality when implementing this we just use '
                                   'a single BERT model. And what we do is we '
                                   'process one sentence, a sentence A through '
                                   'the model and then we process another '
                                   'sentence, sentence B through the model. '
                                   "And that's the sentence pair. So with our "
                                   'cross-linked we were processing the '
                                   'sentence pair together. We were putting '
                                   'them both together, processing them all at '
                                   'once. This time we process them '
                                   'separately. And during training what '
                                   'happens is the weights',
                           'title': 'Intro to Sentence Embeddings with '
                                    'Transformers',
                           'url': 'https://youtu.be/WS1uVMGhlWQ'},
              'score': 0.85855335,
              'sparseValues': {},
              'values': []}],
 'namespace': ''}

limit = 3750

def retrieve(query):
    res = openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

    # 从 Pinecone 检索
    xq = res['data'][0]['embedding']

    # 获取相关上下文
    res = index.query(xq, top_k=3, include_metadata=True)
    contexts = [
        x['metadata']['text'] for x in res['matches']
    ]

    # 构建我们的提示，包含检索到的上下文
    prompt_start = (
        "根据下面的上下文回答问题。\n\n" +
        "上下文：\n"
    )
    prompt_end = (
        f"\n\n问题：{query}\n答案："
    )
    # 追加上下文直到达到限制
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt

# 首先我们从 Pinecone 检索相关项
query_with_contexts = retrieve(query)
query_with_contexts

"根据下面的上下文回答问题。\n\n上下文：\npairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B here and then you have a score from from 0 to 1 that tells you the similarity between those two scores and you would train this using something like cosine similarity loss. Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer you can use multilingual parallel data. So what I mean by that is so parallel data just means translation pairs so if you have for example a English sentence and then you have another language here so it can it can be anything I'm just going to put XX and that XX is your target language you can fine-tune a model using something called multilingual knowledge distillation and what that does is takes a monolingual model for example in English and using those translation pairs it distills the knowledge the semantic similarity knowledge from that monolingual English model into a multilingual model which can handle both English and your target language. So they're three options quite popular very common that you can go for and as a supervised methods the chances are that probably going to outperform anything you do with unsupervised training at least for now. So if none of those sound like something\n\n---\n\nwere actually more accurate. So we can't really do that. We can't use this what is called a mean pooling approach. Or we can't use it in its current form. Now the solution to this problem was introduced by two people in 2019 Nils Reimers and Irenia Gurevich. They introduced what is the first sentence transformer or sentence BERT. And it was found that sentence BERT or S BERT outformed all of the previous Save the Art models on pretty much all benchmarks. Not all of them but most of them. And it did it in a very quick time. So if we compare it to BERT, if we wanted to find the most similar sentence pair from 10,000 sentences in that 2019 paper they found that with BERT that took 65 hours. With S BERT embeddings they could create all the embeddings in just around five seconds. And then they could compare all those with cosine similarity in 0.01 seconds. So it's a lot faster. We go from 65 hours to just over five seconds which is I think pretty incredible. Now I think that's pretty much all the context we need behind sentence transformers. And what we do now is dive into a little bit of how they actually work. Now we said before we have the core transform models and what S BERT does is fine tunes on sentence pairs using what is called a Siamese architecture or Siamese network. What we mean by a Siamese network is that we have what we can see, what can view as two BERT models that are identical and the weights between those two models are tied. Now in reality when implementing this we just use a single BERT model. And what we do is we process one sentence, a sentence A through the model and then we process another sentence, sentence B through the model. And that's the sentence pair. So with our cross-linked we were processing the sentence pair together. We were putting them both together, processing them all at once. This time we process them separately. And during training what happens is the weights\n\n---\n\nTransformer-based Sequential Denoising Autoencoder. So what we'll do is jump straight into it and take a look at where we might want to use this training approach and and how we can actually implement it. So the first question we need to ask is do we really need to resort to unsupervised training? Now what we're going to do here is just have a look at a few of the most popular training approaches and what sort of data we need for that. So the first one we're looking at here is Natural Language Inference or NLI and NLI requires that we have pairs of sentences that are labeled as either contradictory, neutral which means they're not necessarily related or as entailing or as inferring each other. So you have pairs that entail each other so they are both very similar pairs that are neutral and also pairs that are contradictory. And this is the traditional NLI data. Now using another version of fine-tuning with NLI called a multiple negatives ranking loss you can get by with only entailment pairs so pairs that are related to each other or positive pairs and it can also use contradictory pairs to improve the performance of training as well but you don't need it. So if you have positive pairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B\n\n问题：当只有相关句子对时，我应该为句子转换器使用哪种训练方法？\n答案：

```python
# 然后我们完成包含上下文的查询
complete(query_with_contexts)

'您应该使用带有多个负相关排名损失的自然语言推理（NLI）。'

我们立即得到了一个很好的答案，明确指出了使用_多重排名损失_（也称为_多重负相关排名损失_）。