如何为 SQL 生成测试和评估 LLM

LLM 的响应在根本上是非确定性的，这种属性使它们在响应中具有极强的创造性和动态性。然而，这种特质在实现一致性方面带来了重大挑战，而一致性是 LLM 集成到生产环境中的关键方面。

要利用 LLM 的潜力来实现实际应用，关键在于进行一致且系统的评估。这可以识别和纠正不一致之处，并有助于在应用程序不断发展时跟踪进度的变化。

本 Notebook 的范围

本 Notebook 旨在演示一个评估 LLM 的框架，特别侧重于：

单元测试： 评估应用程序各个组件的必要手段。
评估指标： 定量衡量模型有效性的方法。
运行手册文档： 记录历史评估以跟踪进度和回归。

此示例侧重于自然语言到 SQL 的用例——代码生成用例非常适合此方法，当您将代码验证与代码执行相结合时，您的应用程序就可以在代码生成时对其进行实际测试，以确保一致性。

尽管本 Notebook 使用 SQL 生成用例来演示概念，但该方法是通用的，可以应用于各种由 LLM 驱动的应用程序。

我们将使用提示的两个版本来执行 SQL 生成。然后，我们将使用单元测试和评估函数来测试提示的性能。具体来说，在此演示中，我们将评估：

JSON 响应的一致性。
响应中 SQL 的语法正确性。

设置： 安装所需的库，下载包含 SQL 查询和相应自然语言翻译的数据。
测试开发： 创建单元测试并定义 SQL 生成过程的评估指标。
评估： 使用不同的提示进行测试，以评估对性能的影响。
报告： 编译一份报告，简洁地呈现跨各种测试观察到的性能差异。

设置

导入我们的库以及我们将使用的数据集，即来自 HuggingFace 的自然语言到 SQL b-mc2/sql-create-context 数据集。

# 取消注释此行以安装所有必需的依赖项
# !pip install openai datasets pandas pydantic matplotlib python-dotenv numpy tqdm

from datasets import load_dataset
from openai import OpenAI
import pandas as pd
import pydantic
import os
import sqlite3
from sqlite3 import Error
from pprint import pprint
import matplotlib.pyplot as plt
import numpy as np
from dotenv import load_dotenv
from tqdm.notebook import tqdm
from IPython.display import HTML, display

# 加载本地 .env 文件中的密钥以在环境变量中设置 API KEY
%reload_ext dotenv
%dotenv

GPT_MODEL = 'gpt-4o'
dataset = load_dataset("b-mc2/sql-create-context")

print(dataset['train'].num_rows, "行")

78577 行

查看数据集

我们使用 Huggingface datasets 库下载 SQL create context 数据集。此数据集包含：

问题，用自然语言表达
答案，用 SQL 表达，旨在回答自然语言问题。
上下文，用 CREATE SQL 语句表达，描述可能用于回答问题的表。

在今天的演示中，我们将使用 LLM 来尝试回答问题（用自然语言）。LLM 将被期望生成一个 CREATE SQL 语句来创建一个适合回答用户问题的上下文，以及一个相应的 SELECT SQL 查询来完整地回答用户的问题。

数据集看起来像这样：

sql_df = dataset['train'].to_pandas()
sql_df.head()

	answer	question	context
0	SELECT COUNT(*) FROM head WHERE age > 56	How many heads of the departments are older th...	CREATE TABLE head (age INTEGER)
1	SELECT name, born_state, age FROM head ORDER B...	List the name, born state and age of the heads...	CREATE TABLE head (name VARCHAR, born_state VA...
2	SELECT creation, name, budget_in_billions FROM...	List the creation year, name and budget of eac...	CREATE TABLE department (creation VARCHAR, nam...
3	SELECT MAX(budget_in_billions), MIN(budget_in_...	What are the maximum and minimum budget of the...	CREATE TABLE department (budget_in_billions IN...
4	SELECT AVG(num_employees) FROM department WHER...	What is the average number of employees of the...	CREATE TABLE department (num_employees INTEGER...

测试开发

为了测试 LLM 生成的输出，我们将开发两个单元测试和一个评估，它们将组合起来形成一个基本的评估框架来评定 LLM 迭代的质量。

重申一下，我们的目的是衡量 LLM 输出的正确性和一致性。

单元测试

单元测试应测试 LLM 应用程序最细粒度的组件。

在本节中，我们将开发单元测试来测试以下内容：

test_valid_schema 将检查 LLM 返回的是否是可解析的 create 和 select 语句。
test_llm_sql 将执行 create 和 select 语句在 sqlite 数据库上，以确保它们在语法上是正确的。

from pydantic import BaseModel


class LLMResponse(BaseModel):
    """这是我们期望 LLM 响应的结构。

    LLM 应响应一个包含 `create` 和 `select` 字段的 JSON 字符串。
    """
    create: str
    select: str

提示 LLM

在本演示中，我们使用一个相当简单的提示，要求 GPT 生成一个 (context, answer) 对。context 是 CREATE SQL 语句，answer 是 SELECT SQL 语句。我们将自然语言问题作为提示的一部分提供。我们要求响应为 JSON 格式，以便于解析。

system_prompt = """将此自然语言请求翻译成一个包含两个 SQL 查询的 JSON 对象。
第一个查询应为回答用户请求的 CREATE 语句，第二个查询应为回答用户问题的 SELECT 查询。"""

# 将消息数组发送给 GPT，请求响应（确保您已将 API 密钥加载到环境变量中以进行此步骤）
client = OpenAI()

def get_response(system_prompt, user_message, model=GPT_MODEL):
    messages = []
    messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": user_message})

    response = client.beta.chat.completions.parse(
        model=GPT_MODEL,
        messages=messages,
        response_format=LLMResponse,
    )
    return response.choices[0].message.content

question = sql_df.iloc[0]['question']
content = get_response(system_prompt, question)
print("问题:", question)
print("答案:", content)

问题: How many heads of the departments are older than 56 ?
答案: {"create":"CREATE TABLE DepartmentHeads (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    age INT,
    department VARCHAR(100)
);","select":"SELECT COUNT(*) AS NumberOfHeadsOlderThan56 
FROM DepartmentHeads 
WHERE age > 56;"}

检查 JSON 格式

我们的第一个简单单元测试检查 LLM 响应是否可以解析为我们定义的 LLMResponse Pydantic 类。

我们将测试第一个响应是否通过，然后创建一个失败的示例来检查该检查是否失败。此逻辑将包装在一个名为 test_valid_schema 的简单函数中。

我们期望 GPT 回复有效的 SQL，我们可以使用 LLMResponse 基本模型来验证这一点。test_valid_schema 旨在帮助我们验证这一点。

def test_valid_schema(content):
    """测试提供的内容是否可以解析为我们的 Pydantic 模型。"""
    try:
        LLMResponse.model_validate_json(content)
        return True
    # 捕获 pydantic 的验证错误：
    except pydantic.ValidationError as exc:
        print(f"错误：无效的架构：{exc}")
        return False

test_valid_schema(content)

True

测试负面场景

为了模拟从 GPT 获得无效 JSON 响应的场景，我们硬编码了一个无效的 JSON 作为响应。我们期望 test_valid_schema 函数抛出异常。

failing_query = 'CREATE departments, select * from departments'
test_valid_schema(failing_query)

错误：无效的架构：1 validation error for LLMResponse
  无效的 JSON：在第 1 行第 1 列期望值 [type=json_invalid, input_value='CREATE departments, select * from departments', input_type=str]
    有关更多信息，请访问 https://errors.pydantic.dev/2.10/v/json_invalid





False

正如预期的那样，我们从 test_valid_schema 函数中得到了一个异常。

测试 SQL 查询

接下来，我们将验证 SQL 的正确性。此测试旨在验证：

GPT 响应中返回的 CREATE SQL 语句在语法上是否正确。
GPT 响应中返回的 SELECT SQL 语句在语法上是否正确。

为此，我们将使用一个 sqlite 实例。我们将把返回的 SQL 函数指向一个 sqlite 实例。如果 SQL 语句有效，sqlite 实例将接受并执行这些语句；否则，我们预计会抛出异常。

下面的 create_connection 函数将设置一个 sqlite 实例（默认情况下在内存中）并创建一个连接以供以后使用。

# 设置 SQLite 作为我们的测试数据库
def create_connection(db_file=":memory:"):
    """创建到 SQLite 数据库的数据库连接"""
    try:
        conn = sqlite3.connect(db_file)
        # print(sqlite3.version)
    except Error as e:
        print(e)
        return None

    return conn

def close_connection(conn):
    """关闭数据库连接"""
    try:
        conn.close()
    except Error as e:
        print(e)


conn = create_connection()

接下来，我们将创建以下函数来执行语法正确性检查。

test_create：测试 CREATE SQL 语句是否成功的函数。
test_select：测试 SELECT SQL 语句是否成功的函数。
test_llm_sql：执行上述两个测试的包装函数。

def test_select(conn, cursor, select, should_log=True):
    """测试 SQLite select 查询是否可以成功执行。"""
    try:
        if should_log:
            print(f"正在测试 select 查询：{select}")
        cursor.execute(select)
        record = cursor.fetchall()
        if should_log:
            print(f"查询结果：{record}")

        return True

    except sqlite3.Error as error:
        if should_log:
            print("执行 select 查询时出错：", error)
        return False


def test_create(conn, cursor, create, should_log=True):
    """测试 SQLite create 查询是否可以成功执行"""
    try:
        if should_log:
            print(f"正在测试 create 查询：{create}")
        cursor.execute(create)
        conn.commit()

        return True

    except sqlite3.Error as error:
        if should_log:
            print("创建 SQLite 表时出错：", error)
        return False


def test_llm_sql(llm_response, should_log=True):
    """运行一套 SQLite 测试"""
    try:
        conn = create_connection()
        cursor = conn.cursor()

        create_response = test_create(conn, cursor, llm_response.create, should_log=should_log)

        select_response = test_select(conn, cursor, llm_response.select, should_log=should_log)

        if conn:
            close_connection(conn)

        if create_response is not True:
            return False

        elif select_response is not True:
            return False

        else:
            return True

    except sqlite3.Error as error:
        if should_log:
            print("创建 sqlite 表时出错", error)
        return False

# 查看 GPT 返回的 CREATE 和 SELECT SQL

test_query = LLMResponse.model_validate_json(content)
print(f"CREATE SQL 是：{test_query.create}")
print(f"SELECT SQL 是：{test_query.select}")

CREATE SQL 是：CREATE TABLE DepartmentHeads (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    age INT,
    department VARCHAR(100)
);
SELECT SQL 是：SELECT COUNT(*) AS NumberOfHeadsOlderThan56 
FROM DepartmentHeads 
WHERE age > 56;

# 测试 CREATE 和 SELECT SQL 是否有效（我们期望此操作成功）

test_llm_sql(test_query)

正在测试 create 查询：CREATE TABLE DepartmentHeads (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    age INT,
    department VARCHAR(100)
);
正在测试 select 查询：SELECT COUNT(*) AS NumberOfHeadsOlderThan56 
FROM DepartmentHeads 
WHERE age > 56;
查询结果：[(0,)]





True

# 再次执行负面测试以确认失败的 SELECT 将返回错误。

test_failure_query = '{"create": "CREATE TABLE departments (id INT, name VARCHAR(255), head_of_department VARCHAR(255))", "select": "SELECT COUNT(*) FROM departments WHERE age > 56"}'
test_failure_query = LLMResponse.model_validate_json(test_failure_query)
test_llm_sql(test_failure_query)

正在测试 create 查询：CREATE TABLE departments (id INT, name VARCHAR(255), head_of_department VARCHAR(255))
正在测试 select 查询：SELECT COUNT(*) FROM departments WHERE age > 56
执行 select 查询时出错：no such column: age





False

使用 LLM 评估相关性

接下来，我们评估生成的 SQL 是否实际回答了用户的问题。此测试将由 gpt-4o-mini 执行，并将评估与初始用户请求相比，生成的 SQL 查询的相关性。

这是一个简单的示例，它改编自 G-Eval 论文中概述的方法，并在我们另一个 cookbooks 中进行了测试。

EVALUATION_MODEL = "gpt-4o-mini"

EVALUATION_PROMPT_TEMPLATE = """
您将获得一篇为文章撰写的摘要。您的任务是根据一个指标对摘要进行评分。
请务必仔细阅读并理解这些说明。 
请在审查时保持此文档打开，并根据需要参考它。

评估标准：

{criteria}

评估步骤：

{steps}

示例：

请求：

{request}

查询：

{queries}

评估表（仅评分）：

- {metric_name}
"""

# 相关性

RELEVANCY_SCORE_CRITERIA = """
相关性（1-5）- 评估生成的 SQL 查询与原始问题的相关程度。 
查询应包含请求中强调的所有要点。 
已指示标注者惩罚包含冗余和多余信息的查询。
"""

RELEVANCY_SCORE_STEPS = """

1. 仔细阅读请求和查询。
2. 将查询与请求文档进行比较，并确定请求的要点。
3. 评估查询在多大程度上涵盖了请求的要点，以及它包含多少不相关或冗余的信息。
4. 分配 1 到 5 的相关性分数。
"""

def get_geval_score(
    criteria: str, steps: str, request: str, queries: str, metric_name: str
):
    """给定评估标准和观察结果，此函数使用 EVALUATION GPT 根据这些标准评估观察结果。
"""
    prompt = EVALUATION_PROMPT_TEMPLATE.format(
        criteria=criteria,
        steps=steps,
        request=request,
        queries=queries,
        metric_name=metric_name,
    )
    response = client.chat.completions.create(
        model=EVALUATION_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=5,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response.choices[0].message.content

# 对几条记录进行评估测试

evaluation_results = []

for x,y in sql_df.head(3).iterrows():
    score = get_geval_score(
        RELEVANCY_SCORE_CRITERIA,
        RELEVANCY_SCORE_STEPS,
        y['question'],
        y['context'] + '\n' + y['answer'],'relevancy'
    )
    evaluation_results.append((y['question'],y['context'] + '\n' + y['answer'],score))

for result in evaluation_results:
    print(f"用户问题 \t: {result[0]}")
    print(f"返回的 CREATE SQL \t: {result[1].splitlines()[0]}")
    print(f"返回的 SELECT SQL \t: {result[1].splitlines()[1]}")
    print(f"{result[2]}")
    print("*" * 20)

用户问题    : How many heads of the departments are older than 56 ?
返回的 CREATE SQL  : CREATE TABLE head (age INTEGER)
返回的 SELECT SQL  : SELECT COUNT(*) FROM head WHERE age > 56
5
********************
用户问题    : List the name, born state and age of the heads of departments ordered by age.
返回的 CREATE SQL  : CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)
返回的 SELECT SQL  : SELECT name, born_state, age FROM head ORDER BY age
4
********************
用户问题    : List the creation year, name and budget of each department.
返回的 CREATE SQL  : CREATE TABLE department (creation VARCHAR, name VARCHAR, budget_in_billions VARCHAR)
返回的 SELECT SQL  : SELECT creation, name, budget_in_billions FROM department
4
********************

评估

我们将组合测试这些函数，包括我们的单元测试和评估，以测试两个系统提示。

每次输入/输出和分数的迭代都应存储为一次运行。您可以选择在评估中或作为单独的步骤添加 GPT-4 注释，以审查整个运行并突出显示错误原因。

在此示例中，第二个系统提示将包含一个额外的澄清行，以便我们可以评估此行对 SQL 有效性和解决方案质量的影响。

构建测试框架

我们想构建一个函数 test_system_prompt，它将针对给定的系统提示运行我们的单元测试和评估。

def execute_unit_tests(input_df, output_list, system_prompt):
    """单元测试函数，它接收一个数据框并将测试结果附加到 output_list。"""

    for x, y in tqdm(input_df.iterrows(), total=len(input_df)):
        model_response = get_response(system_prompt, y['question'])

        format_valid = test_valid_schema(model_response)

        try:
            test_query = LLMResponse.model_validate_json(model_response)
            # 避免记录，因为我们一次执行许多行
            sql_valid = test_llm_sql(test_query, should_log=False)
        except:
            sql_valid = False

        output_list.append((y['question'], model_response, format_valid, sql_valid))

def evaluate_row(row):
    """简单的评估函数，用于对单元测试结果进行分类。

    如果格式或 SQL 被标记为错误，则返回一个标签，否则表示正确"""
    if row['format'] is False:
        return '格式不正确'
    elif row['sql'] is False:
        return 'SQL 不正确'
    else:
        return 'SQL 正确'

def test_system_prompt(test_df, system_prompt):
    # 执行单元测试并捕获结果
    results = []
    execute_unit_tests(
        input_df=test_df,
        output_list=results,
        system_prompt=system_prompt
    )

    results_df = pd.DataFrame(results)
    results_df.columns = ['question','response','format','sql']

    # 使用 `apply` 计算每个生成响应的 geval 分数和单元测试评估
    #
    results_df['evaluation_score'] = results_df.apply(
        lambda x: get_geval_score(
            RELEVANCY_SCORE_CRITERIA,
            RELEVANCY_SCORE_STEPS,
            x['question'],
            x['response'],
            'relevancy'
        ),
        axis=1
    )
    results_df['unit_test_evaluation'] = results_df.apply(
        lambda x: evaluate_row(x),
        axis=1
    )
    return results_df

系统提示 1

被测系统是如下所示的第一个系统提示。此 运行 将为此系统提示生成响应，并使用我们迄今为止创建的函数来评估响应。

system_prompt = """将此自然语言请求翻译成一个包含两个 SQL 查询的 JSON 对象。

第一个查询应为回答用户请求的 CREATE 语句，第二个查询应为回答用户问题的 SELECT 查询。
"""

# 选择 50 个未见过的查询来测试此提示
test_df = sql_df.tail(50)

results_df = test_system_prompt(test_df, system_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

我们现在可以汇总结果：

单元测试，用于测试响应的结构；以及
评估，用于检查 SQL 是否在语法上正确。

results_df['unit_test_evaluation'].value_counts()

unit_test_evaluation
SQL correct      46
SQL incorrect     4
Name: count, dtype: int64

results_df['evaluation_score'].value_counts()

evaluation_score
5    33
4    16
3     1
Name: count, dtype: int64

系统提示 2

现在我们使用新的系统提示来运行相同的单元测试和评估。

system_prompt_2 = """将此自然语言请求翻译成一个包含两个 SQL 查询的 JSON 对象。

第一个查询应为回答用户请求的 CREATE 语句，第二个查询应为回答用户问题的 SELECT 查询。

确保 SQL 始终在一行中生成，切勿使用 \\n 分隔行。"""


results_2_df = test_system_prompt(test_df, system_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

与上面一样，我们可以汇总单元测试和评估结果。

results_2_df['unit_test_evaluation'].value_counts()

unit_test_evaluation
SQL correct      44
SQL incorrect     6
Name: count, dtype: int64

results_2_df['evaluation_score'].value_counts()

evaluation_score
5    34
4    15
3     1
Name: count, dtype: int64

报告

我们将创建一个简单的 DataFrame 来存储和显示运行性能——您可以在此处使用 Weights & Biases Prompts 或 Gantry 等工具来存储结果，以便对您的不同迭代进行分析。

results_df['run'] = 1
results_df['Evaluating Model'] = 'gpt-4'

results_2_df['run'] = 2
results_2_df['Evaluating Model'] = 'gpt-4'

run_df = pd.concat([results_df,results_2_df])
run_df.head()

	question	response	format	sql	evaluation_score	unit_test_evaluation	run	Evaluating Model
0	What venue did the parntership of shoaib malik...	{"create":"CREATE TABLE cricket_partnerships (...	True	True	5	SQL correct	1	gpt-4
1	What venue did the partnership of herschelle g...	{"create":"CREATE TABLE CricketPartnerships (\...	True	True	5	SQL correct	1	gpt-4
2	What is the number Played that has 310 Points ...	{"create":"CREATE TABLE game_stats (\n numb...	True	True	5	SQL correct	1	gpt-4
3	What Losing bonus has a Points against of 588?	{"create":"CREATE TABLE BonusInfo (\n id IN...	True	True	5	SQL correct	1	gpt-4
4	What Tries against has a Losing bonus of 7?	{"create":"CREATE TABLE matches (\n id SERI...	True	True	5	SQL correct	1	gpt-4

绘制单元测试结果

我们可以创建一个简单的条形图来可视化两个运行的单元测试结果。

unittest_df_pivot = pd.pivot_table(
    run_df,
    values='format',
    index=['run','unit_test_evaluation'],
    aggfunc='count'
)
unittest_df_pivot.columns = ['Number of records']
unittest_df_pivot

		Number of records
run	unit_test_evaluation
1	SQL correct	46
1	SQL incorrect	4
2	SQL correct	44
2	SQL incorrect	6

unittest_df_pivot.reset_index(inplace=True)

# 绘图
plt.figure(figsize=(10, 6))

# 设置每个条形的宽度
bar_width = 0.35

# OpenAI 品牌颜色
openai_colors = ['#00D1B2', '#000000']  # 绿色和黑色

# 获取唯一的运行和单元测试评估
unique_runs = unittest_df_pivot['run'].unique()
unique_unit_test_evaluations = unittest_df_pivot['unit_test_evaluation'].unique()

# 如果颜色不够，则重复该模式
colors = openai_colors * (len(unique_runs) // len(openai_colors) + 1)

# 迭代每个运行以进行绘图
for i, run in enumerate(unique_runs):
    run_data = unittest_df_pivot[unittest_df_pivot['run'] == run]

    # 此运行的条形位置
    positions = np.arange(len(unique_unit_test_evaluations)) + i * bar_width

    plt.bar(positions, run_data['Number of records'], width=bar_width, label=f'Run {run}', color=colors[i])

# 设置 x 轴标签为单元测试评估，居中显示在组下方
plt.xticks(np.arange(len(unique_unit_test_evaluations)) + bar_width / 2, unique_unit_test_evaluations)

plt.xlabel('单元测试评估')
plt.ylabel('记录数')
plt.title('每个运行的单元测试评估与记录数')
plt.legend()
plt.show()

png

绘制评估结果

我们可以类似地绘制评估结果。

evaluation_df_pivot = pd.pivot_table(
    run_df,
    values='format',
    index=['run','evaluation_score'],
    aggfunc='count'
)
evaluation_df_pivot.columns = ['Number of records']
evaluation_df_pivot

		Number of records
run	evaluation_score
1	3	1
	4	16
	5	33
2	3	1
	4	15
	5	34

# 重置索引，但不删除 'run' 和 'evaluation_score' 列
evaluation_df_pivot.reset_index(inplace=True)

# 绘图
plt.figure(figsize=(10, 6))

bar_width = 0.35

# OpenAI 品牌颜色
openai_colors = ['#00D1B2', '#000000']  # 绿色, 黑色

# 识别唯一的运行和评估分数
unique_runs = evaluation_df_pivot['run'].unique()
unique_evaluation_scores = evaluation_df_pivot['evaluation_score'].unique()

# 如果运行数多于颜色数，则重复颜色
colors = openai_colors * (len(unique_runs) // len(openai_colors) + 1)

for i, run in enumerate(unique_runs):
    # 只选择此运行的行
    run_data = evaluation_df_pivot[evaluation_df_pivot['run'] == run].copy()

    # 确保每个 'evaluation_score' 都存在
    run_data.set_index('evaluation_score', inplace=True)
    run_data = run_data.reindex(unique_evaluation_scores, fill_value=0)
    run_data.reset_index(inplace=True)

    # 绘制每个条形
    positions = np.arange(len(unique_evaluation_scores)) + i * bar_width
    plt.bar(
        positions,
        run_data['Number of records'],
        width=bar_width,
        label=f'Run {run}',
        color=colors[i]
    )

# 配置 x 轴以在分组条形下方显示评估分数
plt.xticks(np.arange(len(unique_evaluation_scores)) + bar_width / 2, unique_evaluation_scores)

plt.xlabel('评估分数')
plt.ylabel('记录数')
plt.title('每个运行的评估分数与记录数')
plt.legend()
plt.show()

png

结论

现在您有了一个使用 LLM 进行 SQL 生成的测试框架，并且通过一些调整，此方法可以扩展到许多其他代码生成用例。通过 GPT-4 和积极参与的人工标注者，您可以自动化这些测试用例的评估，从而形成一个迭代循环，在其中添加新的示例到测试集中，并且此结构可以检测任何性能回归。

希望这对您有所帮助，请提供任何反馈。