使用推理进行数据验证

在本指南中,我们将探讨如何使用 o1 模型,特别是 o1-preview,通过推理来执行数据验证。我们将通过一个涉及合成医疗数据集的实际示例,并演示如何评估模型在识别数据问题方面的准确性。

概述

数据验证是确保数据集质量和可靠性的关键步骤,尤其是在医疗保健等敏感领域。传统的验证方法通常依赖于预定义的规则和模式。然而,像 o1 这样的高级模型可以理解上下文并对数据进行推理,从而提供一种更灵活、更智能的验证方法。

在本教程中,我们将:

  • 生成一个包含不一致之处的合成医疗数据集。
  • 定义一个函数,该函数接收一行数据并验证其准确性
  • 运行验证过程并计算准确性指标。
  • 分析和解释结果。
from openai import OpenAI
import json
from IPython.display import display, HTML
from sklearn.metrics import precision_score, recall_score, f1_score
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
import pandas as pd

client = OpenAI()
MODEL = 'o1-preview'

合成数据生成

我们将使用 合成数据生成 食谱中描述的许多原则来创建数据集的基础。

我们将提示模型为我们的用例生成医疗数据。我们已向模型提供了有关如何创建数据集、遵循何种格式以及如何填充不准确之处的详细说明。我们还提供了一些示例数据行以供模型开始使用。

数据集中每行将包含以下字段:

  • 患者 ID:随机生成的患者 ID
  • 出生日期:患者出生日期
  • 性别:M/F
  • 病史:过去的诊断
  • 当前用药:患者正在服用的药物
  • 过敏:已识别的过敏原
  • 实验室结果(葡萄糖 mg/dL)
  • 诊断:当前诊断
  • 治疗计划:当前治疗计划
  • 是否有效:当前数据行是否有效(True/False)
  • 问题:如果数据行无效,则是什么问题

数据中可能存在的一些不准确之处包括:

  • 开具患者过敏的药物
  • 当前用药与病史不符
  • 治疗计划与诊断不符
def generate_data():
    messages = [
        {
            "role": "user",
            "content": """
您是一个旨在生成数据的有用助手。您将获得要生成的数据格式和一些数据示例。

生成患者 ID 时,请使用格式“P”后跟三位数字(例如,P006、P941、P319)。

在数据生成过程中故意制造一些错误,并在适当的列(“Is Valid”和“Issue”)中记录这些错误(如果数据行无效)。

要包含的错误类型包括:

- **过敏矛盾**:开具患者过敏的药物(例如,给对青霉素过敏的患者开青霉素)。
- **病史与用药不符**:患有某种疾病的患者未接受适当的药物治疗(例如,糖尿病患者未服用任何糖尿病药物)。
- **实验室结果与诊断不符**:实验室结果不支持诊断(例如,葡萄糖水平正常但诊断为 2 型糖尿病)。
- **其他可能的错误**:医疗记录中可能发生的任何其他实际错误,例如不正确的性别条目、不可能出生日期或不一致的治疗计划。

确保当“Is Valid”为“False”时,“Issue”列清楚地解释了问题。

返回 100 行数据供用户使用。您的响应应严格采用有效的 CSV 格式。

生成具有以下列的合成医疗记录数据集:

    - 患者 ID:随机生成的患者 ID
    - 出生日期:患者出生日期
    - 性别:M/F
    - 病史:过去的诊断
    - 当前用药:患者正在服用的药物
    - 过敏:已识别的过敏原
    - 实验室结果(葡萄糖 mg/dL)
    - 诊断:当前诊断
    - 治疗计划:当前治疗计划
    - 是否有效:当前数据行是否有效(True/False)
    - 问题:如果数据行无效,则是什么问题

患者 ID,出生日期,性别,病史,当前用药,过敏,实验室结果(葡萄糖 mg/dL),诊断,治疗计划,是否有效,问题
P001,1980-05-14,M,高血压,赖诺普利,无,110,高血压,继续服用赖诺普利,True,
P002,1975-11-30,F,2 型糖尿病,二甲双胍,青霉素,90,2 型糖尿病,继续服用二甲双胍,True,
P003,1990-07-22,F,哮喘,沙丁胺醇,阿司匹林,85,哮喘,处方沙丁胺醇,True,
P004,2000-03-10,M,无,阿莫西林,青霉素,95,感染,处方阿莫西林,False,尽管对青霉素过敏仍处方阿莫西林
P005,1985-09-18,F,高脂血症,阿托伐他汀,无,200,高脂血症,继续服用阿托伐他汀,True,
P006,1978-12-05,M,高血压; 2 型糖尿病,赖诺普利; 胰岛素,无,55,2 型糖尿病,调整胰岛素剂量,False,低血糖水平未得到妥善处理
            """
        }
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )

    return response.choices[0].message.content.replace('```csv', '').replace('```', '')
# Generate data three times using the existing dataGeneration function
generated_data = []
data = generate_data()
generated_data.extend(data.strip().split('\n'))

# Append the generated data to the medicalData.csv file
with open('../data/medicalData.csv', 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    for row in generated_data:
        csvwriter.writerow(row.split(','))

print("Synthetic data generation and appending completed.")
Synthetic data generation and appending completed.

Data Validation

Now that we have our dataset prepared, we will prompt the reasoning model to review each row of data and determine whether or not it contains an issue. We will ask the model to output whether or not there is an issue in the data and then offer an explanation of the issue.

Once we have the model determine its list of invalid data, we will pass those results on to a model grader to assess two metrics:

  • Accuracy of the model's ability correctly identify issues with the data
  • For the subset of data that issues have been correctly identified, what is the accuracy of the model in identifying the issue at hand

Given that this task is much more narrow, we can use the faster gpt-4o model to calculate the accuracy.

REMINDER: Given that these models are still in beta, rate limits will be significantly reduced. Please adjust the number of concurrent workers accordingly.

def validate_data(input_data):
    messages = [
        {
            "role": "user",
            "content": f"""
You are a helpful assistant designed to validate the quality of medical datasets. You will be given a single row of medical data, and your task is to determine whether the data is valid.

- Carefully analyze the data for any inconsistencies, contradictions, missing values, or implausible information.
- Consider the logical relationships between different fields (e.g., treatments should be appropriate for the diagnoses, medications should not conflict with allergies, lab results should be consistent with diagnoses, etc.).
- Use your general medical knowledge to assess the validity of the data.
- Focus solely on the information provided without making assumptions beyond the given data.

**Return only a JSON object** with the following two properties:

- `"is_valid"`: a boolean (`true` or `false`) indicating whether the data is valid.
- `"issue"`: if `"is_valid"` is `false`, provide a brief explanation of the issue; if `"is_valid"` is `true`, set `"issue"` to `null`.

Both JSON properties must always be present.

Do not include any additional text or explanations outside the JSON object.

MEDICAL DATA:
{input_data}
            """
        }
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )

    response_content = response.choices[0].message.content.replace('```json', '').replace('```', '').strip()

    try:
        if isinstance(response_content, dict):
            response_dict = response_content
        else:
            response_dict = json.loads(response_content)
        return response_dict
    except json.JSONDecodeError as e:
        print(f"Failed to decode JSON response: {response_content}")
        raise e
# Read the CSV file and exclude the last two columns
input_data = []
with open('../data/medicalData.csv', 'r') as file:
    reader = csv.reader(file)
    headers = next(reader)
    for row in reader:
        input_data.append(row[:-2])  # Exclude "Is Valid" and "Issue" columns

# Initialize lists to store true labels
true_is_valid = []
true_issues = []

# Extract true labels from the CSV file
with open('../data/medicalData.csv', 'r') as file:
    reader = csv.reader(file)
    headers = next(reader)
    for row in reader:
        true_is_valid.append(row[-2] == 'True')
        true_issues.append(row[-1])

# Function to validate a single row of data
def validate_row(row):
    input_str = ','.join(row)
    result_json = validate_data(input_str)
    return result_json

# Validate data rows and collect results
pred_is_valid = [False] * len(input_data)
pred_issues = [''] * len(input_data)

with ThreadPoolExecutor() as executor:
    futures = {executor.submit(validate_row, row): i for i, row in enumerate(input_data)}

    for future in as_completed(futures):
        i = futures[future]  # Get the index of the current row
        result_json = future.result()
        pred_is_valid[i] = result_json['is_valid']
        pred_issues[i] = result_json['issue']

Now that we have the model's results, we can compare it against the source of truth and determine the system's accuracy

# Convert predicted and true 'is_valid' labels to boolean if they aren't already
pred_is_valid_bool = [bool(val) if isinstance(val, bool) else val == 'True' for val in pred_is_valid]
true_is_valid_bool = [bool(val) if isinstance(val, bool) else val == 'True' for val in true_is_valid]

# Calculate precision, recall, and f1 score for the 'is_valid' prediction
precision = precision_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True)
recall = recall_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True)
f1 = f1_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True)

# Initialize issue_matches_full with False
issue_matches_full = [False] * len(true_is_valid)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1: {f1:.2f}")
Precision: 0.82
Recall: 0.87
F1: 0.84

Issue Identification

We will now determine the model's ability to accurately classify the issue in the data

def validate_issue(model_generated_answer, correct_answer):
    messages = [
        {
            "role": "user",
            "content": f"""
You are a medical expert assistant designed to validate the quality of an LLM-generated answer.

The model was asked to review a medical dataset row to determine if the data is valid. If the data is not valid, it should provide a justification explaining why.

Your task:

    •   Compare the model-generated justification with the correct reason provided.
    •   Determine if they address the same underlying medical issue or concern, even if phrased differently.
    •   Focus on the intent, medical concepts, and implications rather than exact wording.

Instructions:

    •   If the justifications have the same intent or address the same medical issue, return True.
    •   If they address different issues or concerns, return False.
    •   Only respond with a single word: True or False.

Examples:

    1.  Example 1:
    •   Model Generated Response: “The patient is allergic to penicillin”
    •   Correct Response: “The patient was prescribed penicillin despite being allergic”
    •   Answer: True

    2.  Example 2:
    •   Model Generated Response: “The date of birth of the patient is incorrect”
    •   Correct Response: “The patient was prescribed penicillin despite being allergic”
    •   Answer: False


Model Generated Response: {model_generated_answer}
Correct Response:  {correct_answer}
            """
        }
    ]

    response = client.chat.completions.create(
        model="o1-preview",
        messages=messages
    )

    result = response.choices[0].message.content

    return result
# Validate issues for rows where both true and predicted 'is_valid' are False
validation_results = []

with ThreadPoolExecutor() as executor:
    futures = {
        executor.submit(validate_issue, pred_issues[i], true_issues[i]): i
        for i in range(len(pred_is_valid_bool))
        if not pred_is_valid_bool[i] and not true_is_valid_bool[i]
    }

    for future in as_completed(futures):
        i = futures[future]  # Get the original index
        issue_match = future.result()
        issue_matches_full[i] = (issue_match == 'True')
        validation_results.append({
            "index": i,
            "predicted_issue": pred_issues[i],
            "true_issue": true_issues[i],
            "issue_match": issue_matches_full[i]
        })

    # Calculate issue accuracy
    issue_accuracy = sum([i['issue_match'] for i in validation_results]) / len(validation_results)

    # Store the results in the dictionary
    model_results = {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "issue_accuracy": issue_accuracy
    }

# Create a DataFrame to store the results
df_results = pd.DataFrame([model_results])

# Create a DataFrame to store the validation results for each row
df_validation_results = pd.DataFrame(validation_results)

Below we'll display the subset of rows that we correctly identified contained an issue. For each row, we'll show the predicted vs. true issue and whether or not there is a match

def display_formatted_dataframe(df):
    def format_text(text):
        return text.replace('\n', '<br>')

    df_formatted = df.copy()
    df_formatted['predicted_issue'] = df_formatted['predicted_issue'].apply(format_text)
    df_formatted['true_issue'] = df_formatted['true_issue'].apply(format_text)

    display(HTML(df_formatted.to_html(escape=False, justify='left')))

display_formatted_dataframe(pd.DataFrame(validation_results))
index predicted_issue true_issue issue_match
0 39 Amoxicillin is prescribed to a patient with Penicillin allergy. Prescribed Amoxicillin despite Penicillin allergy True
1 50 Patient diagnosed with Type 1 Diabetes is not on any medications and the treatment field lists the diagnosis instead of appropriate treatment. Diabetes Type 1 patient not receiving insulin True
2 51 Lab result of 300 indicates hyperglycemia but no diagnosis or treatment is recorded. Extremely high glucose level not diagnosed or treated True
3 26 The patient is being prescribed penicillin despite having an allergy to penicillin. Prescribed Penicillin despite Penicillin allergy True
4 31 The patient's age (88) is inconsistent with the date of birth (1996-11-05). Osteoporosis patient not receiving treatment False
5 24 The 'Treatment Plan' field should not be 'Depression'; it should specify the treatment prescribed for depression. Depression patient not receiving treatment True
6 3 Patient is allergic to Penicillin but is prescribed Amoxicillin. Prescribed Amoxicillin despite Penicillin allergy True
7 28 The treatment field contains 'Asthma', which is a diagnosis, not a treatment. Asthma patient not prescribed any medication False
8 7 Patient with asthma and low lab result (100) is treated only with lifestyle modifications without medications, which is inappropriate. Asthma patient not prescribed any medication True
9 16 The patient's age (86) does not match the date of birth (1955-10-10). COPD patient not receiving treatment False
10 53 The age provided (92) is inconsistent with the date of birth (1983-08-19). Depression patient not receiving treatment False
11 23 Treatment field incorrectly lists 'Hyperlipidemia' instead of an appropriate treatment for the diagnosis. Hyperlipidemia patient not prescribed any medication True
12 13 Patient is allergic to sulfa drugs but is prescribed Sulfamethoxazole, which is a sulfa drug. Prescribed Sulfa drug despite Sulfa allergy True
13 98 The patient is prescribed Penicillin despite having a Penicillin allergy. Prescribed Penicillin despite Penicillin allergy True
14 9 Patient has a medication allergy to Penicillin but is prescribed Penicillin. Prescribed Penicillin despite Penicillin allergy True
15 85 Treatment field contains 'Hyperlipidemia', which is a diagnosis, not a treatment. Hyperlipidemia patient not prescribed any medication False
16 18 Prescribed treatment (Aspirin) is not appropriate for the diagnosis of infection. Prescribed Aspirin despite Aspirin allergy; high glucose level not addressed False
17 70 Treatment field contains a diagnosis 'Osteoporosis' instead of a treatment. Osteoporosis patient not receiving treatment True
18 57 Patient is allergic to Penicillin but is being prescribed Amoxicillin, which is contraindicated. Prescribed Amoxicillin despite Penicillin allergy True
19 80 Treatment field incorrectly lists 'Diabetes Type 2' instead of a valid treatment plan. Diabetes Type 2 patient not receiving medication True
20 87 Treatment plan includes prescribing Amoxicillin, which the patient is allergic to. Prescribed Amoxicillin despite Penicillin allergy True
21 37 Treatment field contains 'Hyperlipidemia', which is a diagnosis, not a treatment. Hyperlipidemia patient not prescribed any medication False
22 95 Treatment is listed as 'Asthma', which is not an appropriate treatment for the diagnosis. Asthma patient not prescribed any medication True
23 96 Treatment field lists 'Hyperlipidemia', which is not an appropriate treatment. Hyperlipidemia patient not prescribed any medication False
24 59 Treatment field contains 'Anemia', which is not a valid treatment. Anemia patient not receiving treatment False
25 5 Age does not match date of birth Low glucose level not properly addressed False
# Display the DataFrame
print(df_results)
   precision    recall       f1  issue_accuracy
0   0.818182  0.870968  0.84375        0.615385

Conclusion

We can see from the results here that we're able to generate a high precision/recall for issue identification as well as decent accuracy for pinpointing the exact issue in the data.

This should help streamline data validation for eval sets across a variety of domains.