使用 OpenAI Evals 入门
注意:OpenAI 现在提供了一个托管的 Evals 产品,并带有 API!我们建议您改用它。 请参阅 Evals
OpenAI Evals 框架包含:
- 用于评估 LLM(大型语言模型)或基于 LLM 构建的系统的框架。
- 一个开放源代码的具有挑战性的 Evals 注册表。
本笔记本将涵盖:
- 评估和 OpenAI Evals 库简介
- 构建 Eval
- 运行 Eval
什么是评估/evals
?
评估是验证和测试 LLM 应用程序所生成输出的过程。拥有强大的评估(“evals”)意味着更稳定、可靠的应用程序,能够抵御代码和模型的更改。Eval 是一项用于衡量 LLM 或 LLM 系统输出质量的任务。给定一个输入提示,会生成一个输出。我们使用一组理想的答案来评估此输出,并找出 LLM 系统的质量。
评估的重要性
如果您正在使用 GPT-4
等基础模型进行构建,创建高质量的 Evals 是您可以做的最有影响力的工作之一。开发 AI 解决方案涉及迭代设计过程。没有 Evals,很难且耗时地理解 不同模型版本和提示如何影响您的用例。
借助 OpenAI 的 持续模型升级,Evals 使您能够以标准化的方式有效测试模型在您的用例中的性能。开发一套针对您的目标定制的 Evals 将帮助您快速有效地了解新模型在您的用例中的表现。您还可以将 Evals 作为 CI/CD 管道的一部分,以确保在部署前达到所需的准确性。
Evals 类型
我们可以通过两种主要方式评估/评分完成情况:编写一些代码中的验证逻辑,或者使用模型本身来检查答案。我们将通过一些示例来介绍每种方法。
编写答案检查逻辑
最简单、最常见的 Eval 类型是有一个输入和一个理想的响应或答案。例如,我们可以有一个 Eval 样本,其中输入是“奥巴马是哪一年首次当选总统的?”,理想答案是“2008”。我们将输入提供给模型并获得完成情况。如果模型说“2008”,则被评为正确。我们可以编写一个字符串匹配来检查完成情况是否包含“2008”一词。如果包含,则认为它是正确的。
考虑另一个 Eval,其输入是生成有效的 JSON:我们可以编写一些代码来尝试将完成情况解析为 JSON,然后在完成情况可解析时将其视为正确。
模型评分:一个两阶段过程,模型首先回答问题,然后我们要求模型查看响应以检查其是否正确。
考虑一个要求模型写一个搞笑笑话的输入。然后模型生成一个完成情况。然后我们创建一个新的输入给模型来回答问题:“这个笑话好笑吗?请先逐步推理,然后回答是或否”,其中包含完成情况。最后,如果原始模型的完成情况以“是”结尾,则认为原始完成情况正确。
模型评分最适合 GPT-4
等最新、最强大的模型,并且如果我们赋予它们推理能力后再做判断。模型评分会有错误率,因此在大规模运行 Evals 之前,通过人工评估来验证性能很重要。为获得最佳结果,最好使用与完成情况不同的模型进行评分,例如使用 GPT-4
来评分 GPT-3.5
的答案。
OpenAI Eval 模板
在使用 Evals 时,我们发现了一些可以适应许多不同基准的“模板”。我们在 OpenAI Evals 库中实现了这些模板,以简化新 Evals 的开发。例如,我们定义了两种可以开箱即用的 Eval 模板:
-
基本 Eval 模板:这些模板包含确定性函数,用于将输出与理想答案进行比较。在期望的模型响应变化很小的情况下,例如回答多项选择题或具有简单答案的简单问题,我们发现以下模板很有用。
-
模型评分模板:这些模板包含函数,其中 LLM 将输出与理想答案进行比较,并尝试判断准确性。在期望的模型响应可能包含显著变化的情况下,例如回答开放式问题,我们发现使用模型自行评分是自动评估的可行策略。
设置
首先,请访问 github.com/openai/evals,使用 git clone git@github.com:openai/evals.git
克隆存储库,并按照 设置说明 进行操作。
要在笔记本的后续部分运行 Evals,您需要设置并指定您的 OpenAI API 密钥。获取 API 密钥后,请使用 OPENAI_API_KEY
环境变量进行指定。
运行 Evals 时,请注意与使用 API 相关的成本。
from openai import OpenAI
import pandas as pd
client = OpenAI()
为 OpenAI Evals 框架构建评估
核心上,Eval 是一个数据集和一个在 YAML 文件中定义的 Eval 类。要开始创建 Eval,我们需要:
jsonl
格式的测试数据集。- 要使用的 Eval 模板。
创建 Eval 数据集
让我们为评估模型生成语法正确的 SQL 的能力创建一个数据集。在此用例中,我们有一系列与汽车制造相关的表。
首先,我们需要创建一个要评估的系统提示。我们将为模型传递说明以及表结构的概述:
"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]"
对于此提示,我们可以问一个具体问题:
"Q: how many car makers are their in germany?"
我们有一个预期的答案:
"A: SELECT count ( * ) FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country = T2.CountryId WHERE T2.CountryName = 'germany'"
数据集需要遵循以下格式:
"input": [{"role": "system", "content": "<input prompt>"}, {"role": "user", "content": <user input>}, "ideal": "correct answer"]
将所有内容放在一起,我们得到:
{"input": [{"role": "system", "content": "TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]\n"}, {"role": "system", "content": "Q: how many car makers are their in germany"}, "ideal": ["A: SELECT count ( * ) FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country = T2.CountryId WHERE T2.CountryName = 'germany'"]}
加快构建 Eval 数据集的过程的一种方法是使用 GPT-4
生成合成数据。
## 使用 GPT-4 生成合成数据
# 定义系统提示和用户输入(应根据具体用例填写这些内容)
system_prompt = """You are a helpful assistant that can ask questions about a database table and write SQL queries to answer the question.
A user will pass in a table schema and your job is to return a question answer pairing. The question should relevant to the schema of the table,
and you can speculate on its contents. You will then have to generate a SQL query to answer the question. Below are some examples of what this should look like.
Example 1
```````````
User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\n
Assistant Response:
Q: How many visitors have visited the museum with the most staff?
A: SELECT count ( * ) FROM VISIT AS T1 JOIN MUSEUM AS T2 ON T1.Museum_ID = T2.Museum_ID WHERE T2.Num_of_Staff = ( SELECT max ( Num_of_Staff ) FROM MUSEUM )
```````````
Example 2
```````````
User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\n
Assistant Response:
Q: What are the names who have a membership level higher than 4?
A: SELECT Name FROM VISITOR AS T1 WHERE T1.Level_of_membership > 4
```````````
Example 3
```````````
User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\n
Assistant Response:
Q: How many tickets of customer id 5?
A: SELECT count ( * ) FROM VISIT AS T1 JOIN VISITOR AS T2 ON T1.visitor_ID = T2.ID WHERE T2.ID = 5
```````````
"""
user_input = "Table car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]"
messages = [{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": user_input
}
]
completion = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=messages,
temperature=0.7,
n=5
)
for choice in completion.choices:
print(choice.message.content + "\n")
Q: What is the average horsepower for cars made in Europe?
A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'
Q: What is the average horsepower for cars made in the USA?
A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'
Q: What is the average horsepower for cars produced in countries from the continent with the id '3'?
A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.ContId = '3'
Q: What is the average horsepower for cars made by makers from Europe?
A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'
Q: What is the average horsepower for cars made in the USA?
A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'
一旦我们有了合成数据,我们就需要将其转换为匹配 Eval 数据集的格式。
eval_data = []
input_prompt = "TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]"
for choice in completion.choices:
question = choice.message.content.split("Q: ")[1].split("\n")[0] # 提取问题
answer = choice.message.content.split("\nA: ")[1].split("\n")[0] # 提取答案
eval_data.append({
"input": [
{"role": "system", "content": input_prompt},
{"role": "user", "content": question},
],
"ideal": answer
})
for item in eval_data:
print(item)
{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in Europe?'}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'"}
{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in the USA?'}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'"}
{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': "What is the average horsepower for cars produced in countries from the continent with the id '3'?"}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.ContId = '3'"}
{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made by makers from Europe?'}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'"}
{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in the USA?'}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'"}
接下来,我们需要创建 Eval 注册表来在框架中运行它。
Eval 框架需要一个 .yaml
文件,该文件具有以下属性结构:
id
- 您的 Eval 的标识符description
- 对您的 Eval 的简短描述disclaimer
- 关于您的 Eval 的附加说明metrics
- 我们可以从三种 Eval 指标中进行选择:match、includes、fuzzyMatch
对于我们的 Eval,我们将配置以下内容:
"""
spider-sql:
id: spider-sql.dev.v0
metrics: [accuracy]
description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.
Yu, Tao, et al. \"Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.\" Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.
disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.
spider-sql.dev.v0:
class: evals.elsuite.modelgraded.classify:ModelBasedClassify
args:
samples_jsonl: sql/spider_sql.jsonl
eval_type: cot_classify
modelgraded_spec: sql
"""""
'\nspider-sql:\n id: spider-sql.dev.v0\n metrics: [accuracy]\n description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.\n Yu, Tao, et al. "Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.\n disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.\nspider-sql.dev.v0:\n class: evals.elsuite.modelgraded.classify:ModelBasedClassify\n args:\n samples_jsonl: sql/spider_sql.jsonl\n eval_type: cot_classify\n modelgraded_spec: sql\n '
运行评估
我们可以使用 oaieval
CLI 来运行此 Eval。要进行设置,请安装库:pip install .
(如果您在本地运行 OpenAI Evals 库)或 pip install oaieval
(如果您正在运行现有的 Eval)。
然后,使用 CLI 运行 Eval:oaieval gpt-3.5-turbo spider-sql
此命令需要一个模型名称和一个 Eval 集名称。请注意,我们提供了两个命令行界面(CLI):oaieval
用于运行单个 Eval,oaievalset
用于运行一组 Eval。有效的 Eval 名称在 evals/registry/evals
下的 YAML 文件中指定,其对应的实现可以在 evals/elsuite
中找到。
!pip install evals --quiet
oaieval
CLI 可以接受各种标志来修改默认行为。您可以运行 oaieval --help
来查看完整的 CLI 选项列表。
运行该命令后,您将在控制台中看到准确性的最终报告,以及一个包含完整报告的临时文件的文件路径。
oaieval
将在 evals/registry/evals
目录中搜索 spider-sql
Eval YAML 文件,遵循上面单元格 4 中指定的格式。Eval 数据集的路径在 Eval YAML 文件中 args
参数下的 samples_jsonl: sql/spider_sql.jsonl
指定,文件内容为 JSONL 格式(如第 3 步中生成的)。
运行该命令后,您将在控制台中看到准确性的最终报告,以及一个包含完整报告的临时文件的文件路径。
!oaieval gpt-3.5-turbo spider-sql --max_samples 25
[2024-03-26 19:44:39,836] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/evals
[2024-03-26 19:44:43,623] [registry.py:257] Loading registry from /Users/shyamal/.evals/evals
[2024-03-26 19:44:43,635] [oaieval.py:189] [1;35mRun started: 240327024443FACXGMKA [0m
[2024-03-26 19:44:43,663] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/modelgraded
[2024-03-26 19:44:43,851] [registry.py:257] Loading registry from /Users/shyamal/.evals/modelgraded
[2024-03-26 19:44:43,853] [data.py:90] Fetching /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/data/sql/spider_sql.jsonl
[2024-03-26 19:44:43,878] [eval.py:36] Evaluating 25 samples
[2024-03-26 19:44:43,952] [eval.py:144] Running in threaded mode with 10 threads!
0%| | 0/25 [00:00<?, ?it/s][2024-03-26 19:44:44,810] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:44,829] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:44,991] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:45,090] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:45,145] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:45,971] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:46,040] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:46,069] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:46,378] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:46,587] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:47,412] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
4%|█▊ | 1/25 [00:03<01:23, 3.46s/it][2024-03-26 19:44:47,714] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
8%|███▌ | 2/25 [00:03<00:36, 1.60s/it][2024-03-26 19:44:47,947] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
12%|█████▎ | 3/25 [00:03<00:21, 1.02it/s][2024-03-26 19:44:48,413] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:48,643] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
16%|███████ | 4/25 [00:04<00:18, 1.15it/s][2024-03-26 19:44:48,909] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
20%|████████▊ | 5/25 [00:04<00:12, 1.54it/s][2024-03-26 19:44:49,131] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:49,500] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:49,530] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
24%|██████████▌ | 6/25 [00:05<00:12, 1.56it/s][2024-03-26 19:44:49,962] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:49,964] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:49,967] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
28%|████████████▎ | 7/25 [00:06<00:10, 1.73it/s][2024-03-26 19:44:50,577] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:50,602] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:50,634] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:50,862] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:51,503] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:51,608] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
40%|█████████████████▏ | 10/25 [00:07<00:08, 1.79it/s][2024-03-26 19:44:51,801] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
44%|██████████████████▉ | 11/25 [00:07<00:06, 2.09it/s][2024-03-26 19:44:51,856] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:51,969] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:52,227] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
52%|██████████████████████▎ | 13/25 [00:08<00:04, 2.65it/s][2024-03-26 19:44:52,450] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:52,526] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:52,615] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
56%|████████████████████████ | 14/25 [00:08<00:04, 2.64it/s][2024-03-26 19:44:52,625] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:52,777] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:53,653] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
60%|█████████████████████████▊ | 15/25 [00:09<00:05, 1.87it/s][2024-03-26 19:44:53,670] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:54,028] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
68%|█████████████████████████████▏ | 17/25 [00:10<00:03, 2.54it/s][2024-03-26 19:44:54,388] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:54,396] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
72%|██████████████████████████████▉ | 18/25 [00:10<00:02, 2.58it/s][2024-03-26 19:44:54,529] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:54,585] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
76%|████████████████████████████████▋ | 19/25 [00:10<00:02, 2.94it/s][2024-03-26 19:44:54,980] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
80%|██████████████████████████████████▍ | 20/25 [00:11<00:01, 2.82it/s][2024-03-26 19:44:55,152] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
84%|████████████████████████████████████ | 21/25 [00:11<00:01, 3.27it/s][2024-03-26 19:44:56,420] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
88%|█████████████████████████████████████▊ | 22/25 [00:12<00:01, 1.75it/s][2024-03-26 19:44:56,984] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
92%|███████████████████████████████████████▌ | 23/25 [00:13<00:01, 1.76it/s][2024-03-26 19:44:57,370] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
96%|█████████████████████████████████████████▎ | 24/25 [00:13<00:00, 1.94it/s][2024-03-26 19:44:59,589] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
100%|███████████████████████████████████████████| 25/25 [00:15<00:00, 1.60it/s]
[2024-03-26 19:44:59,607] [record.py:360] Final report: {'counts/Correct': 20, 'counts/Incorrect': 5, 'score': 0.8}. Logged to /tmp/evallogs/240327024443FACXGMKA_gpt-3.5-turbo_spider-sql.jsonl
[2024-03-26 19:44:59,608] [oaieval.py:229] Final report:
[2024-03-26 19:44:59,608] [oaieval.py:231] counts/Correct: 20
[2024-03-26 19:44:59,608] [oaieval.py:231] counts/Incorrect: 5
[2024-03-26 19:44:59,608] [oaieval.py:231] score: 0.8
[2024-03-26 19:44:59,640] [record.py:349] Logged 75 rows of events to /tmp/evallogs/240327024443FACXGMKA_gpt-3.5-turbo_spider-sql.jsonl: insert_time=27.915ms
oaievalset
需要一个模型名称和一个 Eval 集名称,其有效选项在 evals/registry/eval_sets
下的 YAML 文件中指定。
查看 Eval 日志
Eval 日志位于 /tmp/evallogs
,并且为每次 Eval 运行创建了不同的日志文件。
log_name = '240327024443FACXGMKA_gpt-3.5-turbo_spider-sql.jsonl' # "EDIT THIS" - copy from above
events = f"/tmp/evallogs/{log_name}"
display(pd.read_json(events, lines=True).head(5))
spec | final_report | run_id | event_id | sample_id | type | data | created_by | created_at | |
---|---|---|---|---|---|---|---|---|---|
0 | {'completion_fns': ['gpt-3.5-turbo'], 'eval_name': 'spider-sql.dev.v0', 'base_eval': 'spider-sql', 'split': 'dev', 'run_config': {'completion_fns': ['gpt-3.5-turbo'], 'eval_spec': {'cls': 'evals.elsuite.modelgraded.classify:ModelBasedClassify', 'registry_path': '/Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry', 'args': {'samples_jsonl': 'sql/spider_sql.jsonl', 'eval_type': 'cot_classify', 'modelgraded_spec': 'sql'}, 'key': 'spider-sql.dev.v0', 'group': 'sql'}, 'seed': 20220722, 'max_samples': 25, 'command': '/Users/shyamal/.virtualenvs/openai/bin/oaieval gpt-3.5-turbo spider-sql --max_samples 25', 'initial_settings': {'visible': False}}, 'created_by': '', 'run_id': '240327024443FACXGMKA', 'created_at': '2024-03-27 02:44:43.626043'} | NaN | NaN | NaN | NaN | NaN | NaN | NaT | |
1 | NaN | {'counts/Correct': 20, 'counts/Incorrect': 5, 'score': 0.8} | NaN | NaN | NaN | NaN | NaN | NaT | |
2 | NaN | NaN | 240327024443FACXGMKA | 0.0 | spider-sql.dev.88 | sampling | {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct. Table: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text) Table: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number) Table: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number) Question: Find the average rank of winners in all matches. ', 'role': 'system'}], 'sampled': ['SELECT AVG(winner_rank) AS average_rank_of_winners FROM matches;']} | 2024-03-27 02:44:44.821110+00:00 | |
3 | NaN | NaN | 240327024443FACXGMKA | 1.0 | spider-sql.dev.82 | sampling | {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct. Table: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text) Table: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number) Table: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number) Question: Find the total number of matches. ', 'role': 'system'}], 'sampled': ['SELECT COUNT(*) AS total_matches FROM matches;']} | 2024-03-27 02:44:44.831848+00:00 | |
4 | NaN | NaN | 240327024443FACXGMKA | 2.0 | spider-sql.dev.25 | sampling | {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct. Table: continents. Columns: ContId (number), Continent (text) Table: countries. Columns: CountryId (number), CountryName (text), Continent (number) Table: car_makers. Columns: Id (number), Maker (text), FullName (text), Country (text) Table: model_list. Columns: ModelId (number), Maker (number), Model (text) Table: car_names. Columns: MakeId (number), Model (text), Make (text) Table: cars_data. Columns: Id (number), MPG (text), Cylinders (number), Edispl (number), Horsepower (text), Weight (number), Accelerate (number), Year (number) Question: How many countries exist? ', 'role': 'system'}], 'sampled': ['SELECT COUNT(*) AS TotalCountries FROM countries;']} | 2024-03-27 02:44:44.996647+00:00 |
# 处理 oaieval 生成的日志事件
with open(events, "r") as f:
events_df = pd.read_json(f, lines=True)
此文件将包含评估的结构化日志。第一条条目提供了评估的详细规范,包括完成函数、评估名称、运行配置、创建者姓名、运行 ID 和创建时间戳。
display(events_df.iloc[0].spec)
{'completion_fns': ['gpt-3.5-turbo'],
'eval_name': 'spider-sql.dev.v0',
'base_eval': 'spider-sql',
'split': 'dev',
'run_config': {'completion_fns': ['gpt-3.5-turbo'],
'eval_spec': {'cls': 'evals.elsuite.modelgraded.classify:ModelBasedClassify',
'registry_path': '/Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry',
'args': {'samples_jsonl': 'sql/spider_sql.jsonl',
'eval_type': 'cot_classify',
'modelgraded_spec': 'sql'},
'key': 'spider-sql.dev.v0',
'group': 'sql'},
'seed': 20220722,
'max_samples': 25,
'command': '/Users/shyamal/.virtualenvs/openai/bin/oaieval gpt-3.5-turbo spider-sql --max_samples 25',
'initial_settings': {'visible': False}},
'created_by': '',
'run_id': '240327024443FACXGMKA',
'created_at': '2024-03-27 02:44:43.626043'}
我们还看一下提供评估最终报告的条目。
display(events_df.dropna(subset=['final_report']).iloc[0]['final_report'])
{'counts/Correct': 20, 'counts/Incorrect': 5, 'score': 0.8}
我们还可以查看提供特定样本(sample_id
)、结果、事件类型和元数据的单个评估事件。
pd.set_option('display.max_colwidth', None) # None 表示不截断
display(events_df.iloc[2][['run_id', 'event_id', 'sample_id', 'type', 'data', 'created_at']])
run_id 240327024443FACXGMKA
event_id 0.0
sample_id spider-sql.dev.88
type sampling
data {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.
Use only the following tables and columns: Table: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text) Table: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number) Table: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)
Question: Find the average rank of winners in all matches. ', 'role': 'system'}], 'sampled': ['SELECT AVG(winner_rank) AS average_rank_of_winners FROM matches;']}