使用 Braintrust 构建 LLM-as-a-judge 评估来检测幻觉

假设你正在开发一个客服机器人，并试图评估其回复的质量。考虑一个问题，例如“你们的退货政策是什么？”如果正确答案是“您可以在购买后 30 天内退货”，但您的机器人生成的答案是“您可以在 30 天内退货”，您将如何评估这个回复是否好呢？

像 Levenshtein 字符串距离这样的启发式方法会表明回复不正确。然而，一个更好的方法是使用 LLM-as-a-judge 来评估回复的准确性。LLM-as-a-judge 是一种利用 LLM 来评估答案质量的技术。LLM 可以推理语言，而不仅仅是进行表面字符串比较，从而能够更准确地评估答案。

在本指南中，我们将介绍如何构建一个 LLM-as-a-judge 评分器来检测幻觉，我们将使用 Braintrust，这是一个与 OpenAI 的模型兼容的第三方评估平台。

安装依赖项

让我们安装一些基本的依赖项。我们将使用 CoQA 数据集（通过 DuckDB）、用于评估的 Braintrust 和 OpenAI 的模型。请注意，Braintrust 是一个第三方评估平台，在继续之前，您应查阅其服务条款和隐私政策。

%pip install autoevals duckdb braintrust openai --quiet

注意：您可能需要重启内核才能使用更新的包。

接下来，让我们初始化 OpenAI 客户端。我们将使用 AsyncOpenAI 客户端，以便能够并行化我们的请求。braintrust.wrap_openai 函数包装了 OpenAI 客户端，以便将 LLM 调用记录到 Braintrust。我们将使用 Braintrust 来促进以下评估。在继续之前，您应该注册一个 Braintrust 账户并将您的环境变量 BRAINTRUST_API_KEY 设置为一个有效的 API 密钥。

import os

import braintrust
from openai import AsyncOpenAI

braintrust.login(api_key=os.environ["BRAINTRUST_API_KEY"])
client = braintrust.wrap_openai(AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"]))

探索数据集

我们将使用 CoQA 数据集，其中包含各种各样的段落、问题和答案。由于 CoQA 数据集相当大，我们将只查看前几个段落。与任何公共数据集一样，底层 LLM 可能已经记住了该数据集的某些方面，因此在开发自己的评分器时，最好使用您自己的私有数据对其进行测试。

import duckdb

# DuckDB 有一个方便的包装器，用于从 Hugging Face 加载数据集。
con = duckdb.connect(":memory:")
full_result = con.query("""
    SELECT * FROM 'hf://datasets/stanfordnlp/coqa/data/validation-00000-of-00001.parquet'
        LIMIT 40
""").fetchall()

single_result = full_result[10]

print("Passage:")
print(single_result[1])

print("\nQuestion:")
print(single_result[2][0])

print("\nAnswer:")
print(single_result[3]["input_text"][0])

Passage:
(CNN)A chiseled boxer's Instagram feed shows him making constant references to the Bible and enjoying gospel singing with his wife.

Another features his formidable opponent counting stacks of money, hanging out in strip clubs, and flashing diamond watches and Ferraris.

Welcome to the world of boxing promotion, circa 2015.

American Floyd Mayweather and Filipino Manny Pacquiao are set to officially announce their heavily anticipated boxing match at a press conference in Los Angeles Wednesday.

With the combined purse for the May 2 bout in Las Vegas reported to touch $300 million pending viewership numbers, the incentives to self-promote could not be higher.

"Nowadays you have to be on social media to launch the fight and to build hype," says boxing promoter Nisse Sauerland, CEO of Team Sauerland. "It couldn't be done without it."

Thirty-eight year old Mayweather (47-0, 26 knockouts), who favors the moniker "The Money Man" or "TBE" (The Best Ever), boasts nearly five million Instagram followers, 5.65 million followers on Twitter and 9.2 million Facebook likes.

He famously confirmed the fight via Shots, a photo sharing social media application that he's invested in, and displays links to his clothing brand, The Money Team, on all his accounts.

Along with professing to the be the best fighter of all time, he could also stake a claim to be one of the greatest social media users in sports.

"I think they're both playing their roles," says Sauerland, who promotes over 45 boxers. "You've got the bad guy and the good guy, really. You've got the guy who throws the money around (Mayweather), that's his image, and Pacquiao, he's the hope of a nation."

Question:
Who are the two boxer featured in this article?

Answer:
Floyd Mayweather and Manny Pacquiao

该数据包含一系列段落，每个段落都有一些问题和答案。让我们将其展平成 (passage, question, answer) 元组列表。

from dataclasses import dataclass


@dataclass
class QuestionAnswer:
    passage: str
    question: str
    expected_answer: str
    generated_answer: str


qa_pairs = [
    QuestionAnswer(
        passage=r[1],
        question=question,
        generated_answer=r[3]["input_text"][i],
        expected_answer=r[3]["input_text"][i],
    )
    for r in full_result
    for (i, question) in enumerate(r[2])
]

print(len(qa_pairs))

添加幻觉

因为 Braintrust 的评分器旨在测试幻觉，所以我们可以使用 QA 对来生成已知的幻觉。我们将通过要求 LLM 在不使用段落的情况下自信地生成每个问题的答案来创建幻觉答案。

import asyncio
import random

random.seed(42)


async def hallucinate_answer(qa):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """\
You are a helpful hallucinating assistant, who makes up fake answers to questions.

Answer the following question in 1 sentence. If you know the answer, then make up some fake
superfluous details that are not in the passage you have memorized.

Make sure to always answer it confidently, even if you don't know the answer. Do not use words
like "perhaps", "likely", "maybe", etc. or punctuation like "...".Do not admit that you cannot
or do not know the answer.""",
            },
            {"role": "user", "content": qa.question},
        ],
        temperature=1,
        max_tokens=100,
    )
    return response.choices[0].message.content


hallucinated_answers = await asyncio.gather(
    *[hallucinate_answer(qa) for qa in qa_pairs]
)


hallucinations = [
    QuestionAnswer(
        passage=qa.passage,
        question=qa.question,
        expected_answer=qa.expected_answer,
        generated_answer=hallucination,
    )
    for (qa, hallucination) in zip(qa_pairs, hallucinated_answers)
    # Exclude simple yes/no answers.
    if "yes" not in hallucination.lower() and "no" not in hallucination.lower()
]

print("Passage:")
print(hallucinations[0].passage)
print("\nQuestion:")
print(hallucinations[0].question)
print("\nExpected Answer:")
print(hallucinations[0].expected_answer)
print("\nGenerated Answer:")
print(hallucinations[0].generated_answer)

print("\n\nNumber of hallucinations:", len(hallucinations))

Passage:
Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing.

"What are you doing, Cotton?!"

"I only wanted to be more like you".

Cotton's mommy rubbed her face on Cotton's and said "Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry.

"Don't ever do that again, Cotton!" they all cried. "Next time you might mess up that pretty white fur of yours and we wouldn't want that!"

Then Cotton thought, "I change my mind. I like being special".

Question:
Where did she live?

Expected Answer:
in a barn

Generated Answer:
She lived in a quaint cottage on the edge of the Misty Hollow Forest, where elves and talking owls often hosted moonlit storytelling festivals.


Number of hallucinations: 270

创建评估器

我们将考虑几种创建 LLM-as-a-judge 的流行方法。对于每种方法，我们将创建一个评分器，然后“元评估”它以查看其性能。由于我们知道幻觉答案是不正确的，我们将通过测试幻觉答案被评为 0 的频率来评估评估器的质量。

LLM-as-a-judge #1：数字评分器

在创建 LLM-as-a-judge 时，一个常见的初步想法是要求 LLM 在 1 到 5 的范围内对答案进行评分。这种方法的优点是易于将 LLM 的输出转换为数字分数。

我们将使用 Factuality 模板的修改版本，但要求 LLM 在 1 到 10 的范围内对答案进行评分。

import json

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {expected}
************
[Submission]: {output}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
Rate the submission on a scale of 1 to 10.
"""


@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                        "required": ["rating"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await numeric_rater(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await numeric_rater(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0.0

这看起来很有希望！现在我们已经对单个示例进行了健全性检查，让我们运行一个适当的评估，看看它在一组更广泛的数据上的表现如何。评估包括三个组成部分：

数据：在这种情况下，input 是问题、幻觉答案和真实答案。评分器将此转换为 0 到 1 之间的分数。预期得分为 0，因为这是一个幻觉。
任务：任务只是为每个输入调用数字评分器。
分数：我们将通过将生成的分数与真实分数进行比较来评估其质量。由于我们知道这两个数字都在 0 到 1 之间，因此我们可以使用归一化差值作为分数。

from dataclasses import asdict

from braintrust import Eval


def data():
    for pair in hallucinations:
        yield dict(
            input=asdict(pair), expected=0, metadata=dict(hallucination=True)
        )


async def task(input):
    return await numeric_rater(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


def normalized_diff(output, expected):
    return 1 - abs(output - expected)


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater",
    max_concurrency=10,
)

Experiment Numeric rater is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater
LLM-as-a-judge [experiment_name=Numeric rater] (data): 270it [00:00, 54634.41it/s]



LLM-as-a-judge [experiment_name=Numeric rater] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]



=========================SUMMARY=========================
95.35% 'normalized_diff' score

201.60tok prompt_tokens
5tok completion_tokens
206.60tok total_tokens

See results for Numeric rater at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater





EvalResultWithSummary(summary="...", results=[...])

数字评分器得分接近 94%。这不算差，但如果 6% 的评估被错误地判断，那么信任它们可能会非常困难。让我们深入 Braintrust UI 来了解一些情况。

部分积分

看起来许多不正确的答案被评为 1 到 10 之间的数字。但是，我们目前无法了解模型给出这些分数的原因。让我们看看下一步是否可以解决这个问题。

LLM-as-a-judge #2：添加推理

让我们调整提示，让 LLM 也推理其评分。这种方法称为思维链推理。除了可能提高分数外，它还将让我们了解模型给出这些分数的原因。

@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "title": "Reasoning",
                                "type": "string",
                            },
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                        "required": ["rating"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await numeric_rater(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await numeric_rater(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0.0

await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater with reasoning",
    max_concurrency=10,
)

Experiment Numeric rater with reasoning is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning
LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (data): 270it [00:00, 111715.70it/s]



LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]



=========================SUMMARY=========================
Numeric rater with reasoning compared to Numeric rater:
92.10% (-03.25%) 'normalized_diff' score    (5 improvements, 63 regressions)

3.68s duration
3.68s llm_duration
239.60tok (+3800.00%) 'prompt_tokens'       (0 improvements, 270 regressions)
136.82tok (+13182.22%) 'completion_tokens'  (0 improvements, 270 regressions)
376.43tok (+16982.22%) 'total_tokens'       (0 improvements, 270 regressions)
0.00$ estimated_cost

See results for Numeric rater with reasoning at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning





EvalResultWithSummary(summary="...", results=[...])

添加推理似乎并没有提高分数（实际上，分数降低了 3%）。但是，如果我们查看其中一个失败的例子，我们将了解模型在想什么。这是一个幻觉答案的例子：

Output

以及分数及其推理：

Reasoning

看起来模型正在应用自己的判断来计算部分积分。这是数字评分的一个常见问题（无论是对模型还是对人类），并且通常可以通过使用更好的提示来解决。

LLM-as-a-judge #3：分类而不是评分

接下来，我们将阐明具体标准并要求模型根据这些标准对答案进行分类。此方法允许我们更精确地指导模型进行我们正在测试的幻觉。直观地说，为模型提供要评分的具体标准将导致更准确的分数。

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {expected}
************
[Submission]: {output}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""

# Since we're testing for hallucinations, penalize (B) as much as (D).
CHOICE_SCORES = {
    "A": 0.5,
    "B": 0,
    "C": 1,
    "D": 0,
    "E": 1,
}


@braintrust.traced
async def classifier(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Call this function to select a choice.",
                    "parameters": {
                        "properties": {
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "type": "string",
                            },
                            "choice": {
                                "description": "The choice",
                                "type": "string",
                                "enum": ["A", "B", "C", "D", "E"],
                            },
                        },
                        "required": ["reasons", "choice"],
                        "type": "object",
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    choice = arguments["choice"]
    return CHOICE_SCORES[choice] if choice in CHOICE_SCORES else None


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await classifier(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await classifier(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0

async def task(input):
    return await classifier(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Classifier",
    max_concurrency=10,
)

Experiment Classifier is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier
LLM-as-a-judge [experiment_name=Classifier] (data): 270it [00:00, 84930.41it/s]



LLM-as-a-judge [experiment_name=Classifier] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]



=========================SUMMARY=========================
Classifier compared to Numeric rater with reasoning:
98.15% (+06.05%) 'normalized_diff' score    (86 improvements, 5 regressions)

4.41s (+72.60%) 'duration'          (104 improvements, 165 regressions)
4.40s (+72.59%) 'llm_duration'      (104 improvements, 165 regressions)
418.60tok (+17900.00%) 'prompt_tokens'      (0 improvements, 270 regressions)
164.91tok (+2809.26%) 'completion_tokens'   (64 improvements, 204 regressions)
583.52tok (+20709.26%) 'total_tokens'       (0 improvements, 270 regressions)
0.00$ (+00.07%) 'estimated_cost'    (8 improvements, 255 regressions)

See results for Classifier at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier





EvalResultWithSummary(summary="...", results=[...])

分类器得分 98%，这是一个显著的改进！

编码此模式

上面的分类器可以简单地重写为：

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{input}}
************
[Expert]: {{expected}}
************
[Submission]: {{output}}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""

Classifier = autoevals.LLMClassifier(
    name="Hallucination detector",
    prompt_template=PROMPT,
    choice_scores={"A": 0.5, "B": 0, "C": 1, "D": 0, "E": 1},
    use_cot=True,
)

后续步骤

作为下一步，您可以深入研究单个改进和回归，以评估它们并考虑对提示进行未来改进。您还可以测试自己的数据，并仔细检查结果是否适用于您的用例。您还可以衡量像 o1 这样的模型，尝试微调一个较小的模型并查看结果是否可重现，或者使用少样本提示将模型与更主观的标准对齐。在所有情况下，您都应该努力评估您的结果，以便能够严格评估每次更改的影响。