如何使用护栏

在本笔记本中,我们分享了为您的 LLM 应用程序实现护栏的示例。护栏是旨在引导您应用程序的侦探性控件的通用术语。鉴于 LLM 的固有随机性,更大的可控性是一个常见要求,因此创建有效的护栏已成为将 LLM 从原型推向生产过程中性能优化最常见的领域之一。

护栏极其多样,几乎可以部署到您可以想象 LLM 可能出错的任何上下文中。本笔记本旨在提供简单的示例,这些示例可以扩展以满足您独特的用例,并概述在决定是否实施护栏以及如何实施时需要考虑的权衡。

本笔记本将重点关注:

  1. 输入护栏,可在不当内容到达 LLM 之前对其进行标记
  2. 输出护栏,可在 LLM 生成的内容到达客户之前对其进行验证

注意:本笔记本将护栏视为围绕 LLM 的侦探性控件的通用术语 - 有关提供预构建护栏框架分发的官方库,请查看以下内容:

import openai

GPT_MODEL = 'gpt-4o-mini'

1. 输入护栏

输入护栏旨在首先防止不当内容到达 LLM - 一些常见用例是:

  • 主题护栏:识别用户何时提出与主题无关的问题,并就 LLM 可以帮助他们解决的主题提供建议。
  • 越狱:检测用户何时试图劫持 LLM 并覆盖其提示。
  • 提示注入:捕获提示注入实例,用户试图隐藏恶意代码,这些代码将在 LLM 执行的任何下游函数中执行。

在所有这些情况下,它们都充当预防性控件,在 LLM 之前或与 LLM 并行运行,并在满足这些标准之一时触发您的应用程序以采取不同的行为。

设计护栏

在设计护栏时,重要的是要考虑准确性延迟成本之间的权衡,您试图以最低的对您底线和用户体验的影响来实现最高的准确性。

我们将从一个简单的主题护栏开始,该护栏旨在检测与主题无关的问题,并在触发时阻止 LLM 回答。此护栏由一个简单的提示组成,并使用 gpt-4o-mini,最大化延迟/成本并保持足够好的准确性,但如果我们想进一步优化,我们可以考虑:

  • 准确性:您可以考虑微调 gpt-4o-mini 或使用少样本示例来提高准确性。如果您有可以帮助确定内容是否允许语料库,RAG 也可以有效。
  • 延迟/成本:您可以尝试微调较小的模型,例如 babbage-002 或 Llama 等开源产品,这些模型在获得足够的训练示例时表现相当不错。在使用开源产品时,您还可以调整用于推理的机器,以最大化成本或延迟的减少。

这个简单的护栏旨在确保 LLM 仅回答预定义的とピック,并用固定的消息响应越界查询。

拥抱异步

一种常见的最小化延迟的设计是将您的护栏与您的主 LLM 调用异步发送。如果您的护栏被触发,则返回它们的回应,否则返回 LLM 的回应。

我们将采用这种方法,创建一个 execute_chat_with_guardrails 函数,该函数将并行运行我们的 LLM 的 get_chat_responsetopical_guardrail 护栏,并且仅当护栏返回 allowed 时才返回 LLM 的回应。

局限性

在开发设计时,您应始终考虑护栏的局限性。一些需要注意的关键点是:

  • 当使用 LLM 作为护栏时,请注意它们具有与您的基础 LLM 调用本身相同的漏洞。例如,提示注入尝试可能成功地规避您的护栏和您的实际 LLM 调用。
  • 随着对话变长,LLM 更容易受到越狱的影响,因为您的指令会被额外的文本稀释。
  • 如果您使护栏过于严格以弥补上述问题,护栏可能会损害用户体验。这表现为过度拒绝,即您的护栏拒绝无害的用户请求,因为它们与提示注入或越狱尝试有相似之处。

缓解措施

如果您可以将护栏与基于规则的或更传统的机器学习模型相结合进行检测,这可以缓解其中一些风险。我们还看到客户使用仅考虑最新消息的护栏,以减轻模型因长对话而感到困惑的风险。

我们还建议进行渐进式推出并积极监控对话,以便您可以发现提示注入或越狱的实例,并添加更多护栏来覆盖这些新类型的行为,或将它们作为训练示例包含在您现有的护栏中。

system_prompt = "You are a helpful assistant."

bad_request = "I want to talk about horses"
good_request = "What are the best breeds of dog for people that like cats?"
import asyncio


async def get_chat_response(user_request):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0.5
    )
    print("Got LLM response")

    return response.choices[0].message.content


async def topical_guardrail(user_request):
    print("Checking topical guardrail")
    messages = [
        {
            "role": "system",
            "content": "Your role is to assess whether the user question is allowed or not. The allowed topics are cats and dogs. If the topic is allowed, say 'allowed' otherwise say 'not_allowed'",
        },
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0
    )

    print("Got guardrail response")
    return response.choices[0].message.content


async def execute_chat_with_guardrails(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again
# Call the main function with the good request - this should go through
response = await execute_chat_with_guardrails(good_request)
print(response)
Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
If you like cats and are considering getting a dog, there are several breeds known for their compatibility with feline friends. Here are some of the best dog breeds that tend to get along well with cats:

1. **Golden Retriever**: Friendly and tolerant, Golden Retrievers often get along well with other animals, including cats.

2. **Labrador Retriever**: Similar to Golden Retrievers, Labs are social and friendly, making them good companions for cats.

3. **Cavalier King Charles Spaniel**: This breed is gentle and affectionate, often forming strong bonds with other pets.

4. **Basset Hound**: Basset Hounds are laid-back and generally have a calm demeanor, which can help them coexist peacefully with cats.

5. **Beagle**: Beagles are friendly and sociable, and they often enjoy the company of other animals, including cats.

6. **Pug**: Pugs are known for their playful and friendly nature, which can make them good companions for cats.

7. **Shih Tzu**: Shih Tzus are typically friendly and adaptable, often getting along well with other pets.

8. **Collie**: Collies are known for their gentle and protective nature, which can extend to their relationships with cats.

9. **Newfoundland**: These gentle giants are known for their calm demeanor and often get along well with other animals.

10. **Cocker Spaniel**: Cocker Spaniels are friendly and affectionate dogs that can get along well with cats if introduced properly.

When introducing a dog to a cat, it's important to do so gradually and supervise their interactions to ensure a positive relationship. Each dog's personality can vary, so individual temperament is key in determining compatibility.
# Call the main function with the bad request - this should get blocked
response = await execute_chat_with_guardrails(bad_request)
print(response)
Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.

看起来我们的护栏奏效了——第一个问题被允许通过,但第二个问题因与主题无关而被阻止。现在我们将把这个概念扩展到管理我们从 LLM 收到的响应。

2. 输出护栏

输出护栏控制 LLM 返回的内容。这些可以采取多种形式,其中一些最常见的包括:

  • 幻觉/事实核查护栏:使用地面真实信息语料库或幻觉响应的训练集来阻止幻觉响应。
  • 审核护栏:将品牌和公司指南应用于审核 LLM 的结果,如果 LLM 违反了这些指南,则阻止或重写其响应。
  • 语法检查: LLM 的结构化输出可能会损坏或无法解析 - 这些护栏会检测到这些问题,然后重试或优雅地失败,从而防止下游应用程序出现故障。
    • 这是函数调用中常见的控件,可确保在 LLM 返回 function_call 时在 arguments 中返回预期的架构。

审核护栏

在这里,我们实现了一个审核护栏,它使用 G-Eval 评估方法的一个版本来对 LLM 响应中不当内容的出现进行评分。此方法在我们其他一些笔记本中有更详细的演示。

为此,我们将创建一个可扩展的框架来审核内容,该框架接受一个 domain 并使用一组 stepscriteria 应用于 content

  1. 我们设置一个域名,它描述我们将要审核的内容类型。
  2. 我们提供标准,清楚地概述内容应该包含什么和不应该包含什么。
  3. 为 LLM 提供分步说明来评估内容。
  4. LLM 返回 1-5 的离散分数。

设置护栏阈值

我们的输出护栏将评估 LLM 的响应,并阻止任何得分高于 3 的响应。设置此阈值是优化的常见领域 - 我们建议构建一个评估集并使用混淆矩阵对结果进行评分,以设置护栏的正确容差。这里的权衡通常是:

  • 更多的误报会导致用户体验碎片化,客户会感到恼火,助手也会显得不那么有用。
  • 更多的误报可能会对您的业务造成持久性损害,因为人们会让助手回答不当问题,或者进行提示注入/越狱。

例如,对于越狱,您可能希望设置一个非常低的阈值,因为您的 LLM 被劫持并用于生成可能出现在社交媒体上的危险内容对您的业务风险非常高。但是,对于我们的用例,我们愿意接受一些误报,因为最坏的情况是有人得到了比拉布拉多犬更适合的标绒贵宾犬,尽管这很令人难过,但可能不会对我们的业务造成持久性损害(我们希望如此)。

domain = "animal breed recommendation"

animal_advice_criteria = """
Assess the presence of explicit recommendation of cat or dog breeds in the content.
The content should contain only general advice about cats and dogs, not specific breeds to purchase."""

animal_advice_steps = """

1. Read the content and the criteria carefully.
2. Assess how much explicit recommendation of cat or dog breeds is contained in the content.
3. Assign an animal advice score from 1 to 5, with 1 being no explicit cat or dog breed advice, and 5 being multiple named cat or dog breeds.
"""

moderation_system_prompt = """
You are a moderation assistant. Your role is to detect content about {domain} in the text provided, and mark the severity of that content.

## {domain}

### Criteria

{scoring_criteria}

### Instructions

{scoring_steps}

### Content

{content}

### Evaluation (score only!)
"""
async def moderation_guardrail(chat_response):
    print("Checking moderation guardrail")
    mod_messages = [
        {"role": "user", "content": moderation_system_prompt.format(
            domain=domain,
            scoring_criteria=animal_advice_criteria,
            scoring_steps=animal_advice_steps,
            content=chat_response
        )},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=mod_messages, temperature=0
    )
    print("Got moderation response")
    return response.choices[0].message.content


async def execute_all_guardrails(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                moderation_response = await moderation_guardrail(chat_response)

                if int(moderation_response) >= 3:
                    print(f"Moderation guardrail flagged with a score of {int(moderation_response)}")
                    return "Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have."

                else:
                    print('Passed moderation')
                    return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again
# Adding a request that should pass both our topical guardrail and our moderation guardrail
great_request = 'What is some advice you can give to a new dog owner?'
tests = [good_request,bad_request,great_request]

for test in tests:
    result = await execute_all_guardrails(test)
    print(result)
    print('\n\n')
Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Moderation guardrail flagged with a score of 5
Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Moderation guardrail flagged with a score of 3
Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have.

结论

护栏是 LLM 中一个充满活力且不断发展的主题,我们希望本笔记本能为您提供对护栏核心概念的有效介绍。回顾一下:

  • 护栏是旨在防止有害内容进入您的应用程序和用户,并为您的 LLM 在生产中提供可控性的侦探性控件。
  • 它们可以采取输入护栏的形式,这些护栏在内容到达 LLM 之前进行目标定位,以及输出护栏,这些护栏控制 LLM 的响应。
  • 设计护栏和设置其阈值是在准确性、延迟和成本之间进行权衡。您的决定应基于对护栏性能的清晰评估,以及对您的业务而言误报和漏报成本的理解。
  • 通过采用异步设计原则,您可以水平扩展护栏,以最大程度地减少随着护栏数量和范围的增加对用户的影响。

我们期待看到您如何将此付诸实践,以及随着生态系统的成熟,护栏方面的思考如何发展。