本笔记本演示了一种自定义 OpenAI 嵌入以适应特定任务的方法。
输入是形式为 [text_1, text_2, label] 的训练数据,其中如果配对相似,则标签为 +1,如果配对不相似,则标签为 -1。
输出是一个矩阵,您可以将其用于乘以您的嵌入。此乘法的乘积是“自定义嵌入”,它将更好地强调与您的用例相关的文本方面。在二元分类用例中,我们已将错误率降低了多达 50%。
在以下示例中,我使用了从 SNLI 语料库中挑选的 1,000 个句子对。每对句子在逻辑上都是蕴含关系(即,一个暗示另一个)。这些对是我们的正例(标签 = 1)。我们通过组合来自不同对的句子来生成合成负例,这些句子被假定为不具有逻辑蕴含关系(标签 = -1)。
对于聚类用例,您可以创建同一聚类中文本的配对来生成正例,并创建不同聚类中文本的配对来生成负例。
使用其他数据集,我们发现即使只有约 100 个训练示例也能获得不错的改进。当然,更多的示例会带来更好的性能。
0. 导入
# imports
from typing import List, Tuple # for type hints
import numpy as np # for manipulating arrays
import pandas as pd # for manipulating data in dataframes
import pickle # for saving the embeddings cache
import plotly.express as px # for plots
import random # for generating run IDs
from sklearn.model_selection import train_test_split # for splitting train & test data
import torch # for matrix optimization
from utils.embeddings_utils import get_embedding, cosine_similarity # for embeddings
1. 输入
大多数输入都在这里。需要更改的关键是加载数据集的位置、保存嵌入缓存的位置以及您想使用的嵌入引擎。
根据您的数据格式,您可能需要重写 process_input_data
函数。
# input parameters
embedding_cache_path = "data/snli_embedding_cache.pkl" # embeddings will be saved/loaded here
default_embedding_engine = "text-embedding-3-small"
num_pairs_to_embed = 1000 # 1000 is arbitrary
local_dataset_path = "data/snli_1.0_train_2k.csv" # download from: https://nlp.stanford.edu/projects/snli/
def process_input_data(df: pd.DataFrame) -> pd.DataFrame:
# you can customize this to preprocess your own dataset
# output should be a dataframe with 3 columns: text_1, text_2, label (1 for similar, -1 for dissimilar)
df["label"] = df["gold_label"]
df = df[df["label"].isin(["entailment"])]
df["label"] = df["label"].apply(lambda x: {"entailment": 1, "contradiction": -1}[x])
df = df.rename(columns={"sentence1": "text_1", "sentence2": "text_2"})
df = df[["text_1", "text_2", "label"]]
df = df.head(num_pairs_to_embed)
return df
2. 加载和处理输入数据
# load data
df = pd.read_csv(local_dataset_path)
# process input data
df = process_input_data(df) # this demonstrates training data containing only positives
# view data
df.head()
/var/folders/r4/x3kdvs816995fnnph2gdpwp40000gn/T/ipykernel_17509/1977422881.py:13: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["label"] = df["label"].apply(lambda x: {"entailment": 1, "contradiction": -1}[x])
text_1 | text_2 | label | |
---|---|---|---|
2 | A person on a horse jumps over a broken down a... | A person is outdoors, on a horse. | 1 |
4 | Children smiling and waving at camera | There are children present | 1 |
7 | A boy is jumping on skateboard in the middle o... | The boy does a skateboarding trick. | 1 |
14 | Two blond women are hugging one another. | There are women showing affection. | 1 |
17 | A few people in a restaurant setting, one of t... | The diners are at a restaurant. | 1 |
3. 将数据拆分为训练和测试集
请注意,在生成合成负例或正例之前,将数据拆分为训练集和测试集非常重要。您不希望训练数据中的任何文本字符串出现在测试数据中。如果存在数据污染,测试指标将看起来比实际生产中的要好。
# split data into train and test sets
test_fraction = 0.5 # 0.5 is fairly arbitrary
random_seed = 123 # random seed is arbitrary, but is helpful in reproducibility
train_df, test_df = train_test_split(
df, test_size=test_fraction, stratify=df["label"], random_state=random_seed
)
train_df.loc[:, "dataset"] = "train"
test_df.loc[:, "dataset"] = "test"
4. 生成合成负例
这是您需要修改以匹配您的用例的另一部分代码。
如果您有包含正例和负例的数据,则可以跳过此部分。
如果您只有正例数据,则可以基本保持原样,它会生成负例。
如果您有多类数据,则需要生成正例和负例。正例可以是共享标签的文本对,负例可以是未共享标签的文本对。
最终输出应为一个包含文本对的数据框,其中每对都标记为 -1 或 1。
# generate negatives
def dataframe_of_negatives(dataframe_of_positives: pd.DataFrame) -> pd.DataFrame:
"""Return dataframe of negative pairs made by combining elements of positive pairs."""
texts = set(dataframe_of_positives["text_1"].values) | set(
dataframe_of_positives["text_2"].values
)
all_pairs = {(t1, t2) for t1 in texts for t2 in texts if t1 < t2}
positive_pairs = set(
tuple(text_pair)
for text_pair in dataframe_of_positives[["text_1", "text_2"]].values
)
negative_pairs = all_pairs - positive_pairs
df_of_negatives = pd.DataFrame(list(negative_pairs), columns=["text_1", "text_2"])
df_of_negatives["label"] = -1
return df_of_negatives
negatives_per_positive = (
1 # it will work at higher values too, but more data will be slower
)
# generate negatives for training dataset
train_df_negatives = dataframe_of_negatives(train_df)
train_df_negatives["dataset"] = "train"
# generate negatives for test dataset
test_df_negatives = dataframe_of_negatives(test_df)
test_df_negatives["dataset"] = "test"
# sample negatives and combine with positives
train_df = pd.concat(
[
train_df,
train_df_negatives.sample(
n=len(train_df) * negatives_per_positive, random_state=random_seed
),
]
)
test_df = pd.concat(
[
test_df,
test_df_negatives.sample(
n=len(test_df) * negatives_per_positive, random_state=random_seed
),
]
)
df = pd.concat([train_df, test_df])
5. 计算嵌入和余弦相似度
在这里,我创建了一个缓存来保存嵌入。这很方便,因为如果您想再次运行代码,就不必再次付费。
# establish a cache of embeddings to avoid recomputing
# cache is a dict of tuples (text, engine) -> embedding
try:
with open(embedding_cache_path, "rb") as f:
embedding_cache = pickle.load(f)
except FileNotFoundError:
precomputed_embedding_cache_path = "https://cdn.openai.com/API/examples/data/snli_embedding_cache.pkl"
embedding_cache = pd.read_pickle(precomputed_embedding_cache_path)
# this function will get embeddings from the cache and save them there afterward
def get_embedding_with_cache(
text: str,
engine: str = default_embedding_engine,
embedding_cache: dict = embedding_cache,
embedding_cache_path: str = embedding_cache_path,
) -> list:
if (text, engine) not in embedding_cache.keys():
# if not in cache, call API to get embedding
embedding_cache[(text, engine)] = get_embedding(text, engine)
# save embeddings cache to disk after each update
with open(embedding_cache_path, "wb") as embedding_cache_file:
pickle.dump(embedding_cache, embedding_cache_file)
return embedding_cache[(text, engine)]
# create column of embeddings
for column in ["text_1", "text_2"]:
df[f"{column}_embedding"] = df[column].apply(get_embedding_with_cache)
# create column of cosine similarity between embeddings
df["cosine_similarity"] = df.apply(
lambda row: cosine_similarity(row["text_1_embedding"], row["text_2_embedding"]),
axis=1,
)
6. 绘制余弦相似度分布图
在这里,我们使用余弦相似度来衡量文本的相似度。根据我们的经验,大多数距离函数(L1、L2、余弦相似度)的效果都差不多。请注意,我们的嵌入长度已标准化为 1,因此余弦相似度等同于点积。
这些图显示了相似和不相似配对的余弦相似度分布之间的重叠程度。如果重叠量很大,则意味着存在一些不相似的配对,其余弦相似度大于一些相似的配对。
我计算的准确率是一个简单规则的准确率,该规则预测“相似(1)”,如果余弦相似度高于某个阈值 X,否则预测“不相似(0)”。
# calculate accuracy (and its standard error) of predicting label=1 if similarity>x
# x is optimized by sweeping from -1 to 1 in steps of 0.01
def accuracy_and_se(cosine_similarity: float, labeled_similarity: int) -> Tuple[float]:
accuracies = []
for threshold_thousandths in range(-1000, 1000, 1):
threshold = threshold_thousandths / 1000
total = 0
correct = 0
for cs, ls in zip(cosine_similarity, labeled_similarity):
total += 1
if cs > threshold:
prediction = 1
else:
prediction = -1
if prediction == ls:
correct += 1
accuracy = correct / total
accuracies.append(accuracy)
a = max(accuracies)
n = len(cosine_similarity)
standard_error = (a * (1 - a) / n) ** 0.5 # standard error of binomial
return a, standard_error
# check that training and test sets are balanced
px.histogram(
df,
x="cosine_similarity",
color="label",
barmode="overlay",
width=500,
facet_row="dataset",
).show()
for dataset in ["train", "test"]:
data = df[df["dataset"] == dataset]
a, se = accuracy_and_se(data["cosine_similarity"], data["label"])
print(f"{dataset} accuracy: {a:0.1%} ± {1.96 * se:0.1%}")
train accuracy: 89.1% ± 2.4%
test accuracy: 88.8% ± 2.4%
7. 使用提供的训练数据优化矩阵
def embedding_multiplied_by_matrix(
embedding: List[float], matrix: torch.tensor
) -> np.array:
embedding_tensor = torch.tensor(embedding).float()
modified_embedding = embedding_tensor @ matrix
modified_embedding = modified_embedding.detach().numpy()
return modified_embedding
# compute custom embeddings and new cosine similarities
def apply_matrix_to_embeddings_dataframe(matrix: torch.tensor, df: pd.DataFrame):
for column in ["text_1_embedding", "text_2_embedding"]:
df[f"{column}_custom"] = df[column].apply(
lambda x: embedding_multiplied_by_matrix(x, matrix)
)
df["cosine_similarity_custom"] = df.apply(
lambda row: cosine_similarity(
row["text_1_embedding_custom"], row["text_2_embedding_custom"]
),
axis=1,
)
def optimize_matrix(
modified_embedding_length: int = 2048, # in my brief experimentation, bigger was better (2048 is length of babbage encoding)
batch_size: int = 100,
max_epochs: int = 100,
learning_rate: float = 100.0, # seemed to work best when similar to batch size - feel free to try a range of values
dropout_fraction: float = 0.0, # in my testing, dropout helped by a couple percentage points (definitely not necessary)
df: pd.DataFrame = df,
print_progress: bool = True,
save_results: bool = True,
) -> torch.tensor:
"""Return matrix optimized to minimize loss on training data."""
run_id = random.randint(0, 2 ** 31 - 1) # (range is arbitrary)
# convert from dataframe to torch tensors
# e is for embedding, s for similarity label
def tensors_from_dataframe(
df: pd.DataFrame,
embedding_column_1: str,
embedding_column_2: str,
similarity_label_column: str,
) -> Tuple[torch.tensor]:
e1 = np.stack(np.array(df[embedding_column_1].values))
e2 = np.stack(np.array(df[embedding_column_2].values))
s = np.stack(np.array(df[similarity_label_column].astype("float").values))
e1 = torch.from_numpy(e1).float()
e2 = torch.from_numpy(e2).float()
s = torch.from_numpy(s).float()
return e1, e2, s
e1_train, e2_train, s_train = tensors_from_dataframe(
df[df["dataset"] == "train"], "text_1_embedding", "text_2_embedding", "label"
)
e1_test, e2_test, s_test = tensors_from_dataframe(
df[df["dataset"] == "test"], "text_1_embedding", "text_2_embedding", "label"
)
# create dataset and loader
dataset = torch.utils.data.TensorDataset(e1_train, e2_train, s_train)
train_loader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True
)
# define model (similarity of projected embeddings)
def model(embedding_1, embedding_2, matrix, dropout_fraction=dropout_fraction):
e1 = torch.nn.functional.dropout(embedding_1, p=dropout_fraction)
e2 = torch.nn.functional.dropout(embedding_2, p=dropout_fraction)
modified_embedding_1 = e1 @ matrix # @ is matrix multiplication
modified_embedding_2 = e2 @ matrix
similarity = torch.nn.functional.cosine_similarity(
modified_embedding_1, modified_embedding_2
)
return similarity
# define loss function to minimize
def mse_loss(predictions, targets):
difference = predictions - targets
return torch.sum(difference * difference) / difference.numel()
# initialize projection matrix
embedding_length = len(df["text_1_embedding"].values[0])
matrix = torch.randn(
embedding_length, modified_embedding_length, requires_grad=True
)
epochs, types, losses, accuracies, matrices = [], [], [], [], []
for epoch in range(1, 1 + max_epochs):
# iterate through training dataloader
for a, b, actual_similarity in train_loader:
# generate prediction
predicted_similarity = model(a, b, matrix)
# get loss and perform backpropagation
loss = mse_loss(predicted_similarity, actual_similarity)
loss.backward()
# update the weights
with torch.no_grad():
matrix -= matrix.grad * learning_rate
# set gradients to zero
matrix.grad.zero_()
# calculate test loss
test_predictions = model(e1_test, e2_test, matrix)
test_loss = mse_loss(test_predictions, s_test)
# compute custom embeddings and new cosine similarities
apply_matrix_to_embeddings_dataframe(matrix, df)
# calculate test accuracy
for dataset in ["train", "test"]:
data = df[df["dataset"] == dataset]
a, se = accuracy_and_se(data["cosine_similarity_custom"], data["label"])
# record results of each epoch
epochs.append(epoch)
types.append(dataset)
losses.append(loss.item() if dataset == "train" else test_loss.item())
accuracies.append(a)
matrices.append(matrix.detach().numpy())
# optionally print accuracies
if print_progress is True:
print(
f"Epoch {epoch}/{max_epochs}: {dataset} accuracy: {a:0.1%} ± {1.96 * se:0.1%}"
)
data = pd.DataFrame(
{"epoch": epochs, "type": types, "loss": losses, "accuracy": accuracies}
)
data["run_id"] = run_id
data["modified_embedding_length"] = modified_embedding_length
data["batch_size"] = batch_size
data["max_epochs"] = max_epochs
data["learning_rate"] = learning_rate
data["dropout_fraction"] = dropout_fraction
data[
"matrix"
] = matrices # saving every single matrix can get big; feel free to delete/change
if save_results is True:
data.to_csv(f"{run_id}_optimization_results.csv", index=False)
return data
# example hyperparameter search
# I recommend starting with max_epochs=10 while initially exploring
results = []
max_epochs = 30
dropout_fraction = 0.2
for batch_size, learning_rate in [(10, 10), (100, 100), (1000, 1000)]:
result = optimize_matrix(
batch_size=batch_size,
learning_rate=learning_rate,
max_epochs=max_epochs,
dropout_fraction=dropout_fraction,
save_results=False,
)
results.append(result)
Epoch 1/30: train accuracy: 89.1% ± 2.4%
Epoch 1/30: test accuracy: 88.4% ± 2.4%
Epoch 2/30: train accuracy: 89.5% ± 2.3%
Epoch 2/30: test accuracy: 88.8% ± 2.4%
Epoch 3/30: train accuracy: 90.6% ± 2.2%
Epoch 3/30: test accuracy: 89.3% ± 2.3%
Epoch 4/30: train accuracy: 91.2% ± 2.2%
Epoch 4/30: test accuracy: 89.7% ± 2.3%
Epoch 5/30: train accuracy: 91.5% ± 2.1%
Epoch 5/30: test accuracy: 90.0% ± 2.3%
Epoch 6/30: train accuracy: 91.9% ± 2.1%
Epoch 6/30: test accuracy: 90.4% ± 2.2%
Epoch 7/30: train accuracy: 92.2% ± 2.0%
Epoch 7/30: test accuracy: 90.7% ± 2.2%
Epoch 8/30: train accuracy: 92.7% ± 2.0%
Epoch 8/30: test accuracy: 90.9% ± 2.2%
Epoch 9/30: train accuracy: 92.7% ± 2.0%
Epoch 9/30: test accuracy: 91.0% ± 2.2%
Epoch 10/30: train accuracy: 93.0% ± 1.9%
Epoch 10/30: test accuracy: 91.6% ± 2.1%
Epoch 11/30: train accuracy: 93.1% ± 1.9%
Epoch 11/30: test accuracy: 91.8% ± 2.1%
Epoch 12/30: train accuracy: 93.4% ± 1.9%
Epoch 12/30: test accuracy: 92.1% ± 2.0%
Epoch 13/30: train accuracy: 93.6% ± 1.9%
Epoch 13/30: test accuracy: 92.4% ± 2.0%
Epoch 14/30: train accuracy: 93.7% ± 1.8%
Epoch 14/30: test accuracy: 92.7% ± 2.0%
Epoch 15/30: train accuracy: 93.7% ± 1.8%
Epoch 15/30: test accuracy: 92.7% ± 2.0%
Epoch 16/30: train accuracy: 94.0% ± 1.8%
Epoch 16/30: test accuracy: 93.0% ± 1.9%
Epoch 17/30: train accuracy: 94.0% ± 1.8%
Epoch 17/30: test accuracy: 93.0% ± 1.9%
Epoch 18/30: train accuracy: 94.2% ± 1.8%
Epoch 18/30: test accuracy: 93.1% ± 1.9%
Epoch 19/30: train accuracy: 94.2% ± 1.8%
Epoch 19/30: test accuracy: 93.1% ± 1.9%
Epoch 20/30: train accuracy: 94.3% ± 1.8%
Epoch 20/30: test accuracy: 93.0% ± 1.9%
Epoch 21/30: train accuracy: 94.5% ± 1.7%
Epoch 21/30: test accuracy: 93.1% ± 1.9%
Epoch 22/30: train accuracy: 94.5% ± 1.7%
Epoch 22/30: test accuracy: 93.3% ± 1.9%
Epoch 23/30: train accuracy: 94.6% ± 1.7%
Epoch 23/30: test accuracy: 93.3% ± 1.9%
Epoch 24/30: train accuracy: 94.6% ± 1.7%
Epoch 24/30: test accuracy: 93.3% ± 1.9%
Epoch 25/30: train accuracy: 94.8% ± 1.7%
Epoch 25/30: test accuracy: 93.3% ± 1.9%
Epoch 26/30: train accuracy: 94.8% ± 1.7%
Epoch 26/30: test accuracy: 93.4% ± 1.9%
Epoch 27/30: train accuracy: 94.8% ± 1.7%
Epoch 27/30: test accuracy: 93.4% ± 1.9%
Epoch 28/30: train accuracy: 94.9% ± 1.7%
Epoch 28/30: test accuracy: 93.4% ± 1.9%
Epoch 29/30: train accuracy: 94.9% ± 1.7%
Epoch 29/30: test accuracy: 93.4% ± 1.9%
Epoch 30/30: train accuracy: 94.9% ± 1.7%
Epoch 30/30: test accuracy: 93.3% ± 1.9%
Epoch 1/30: train accuracy: 89.7% ± 2.3%
Epoch 1/30: test accuracy: 89.1% ± 2.4%
Epoch 2/30: train accuracy: 89.8% ± 2.3%
Epoch 2/30: test accuracy: 89.9% ± 2.3%
Epoch 3/30: train accuracy: 90.3% ± 2.2%
Epoch 3/30: test accuracy: 90.0% ± 2.3%
Epoch 4/30: train accuracy: 91.0% ± 2.2%
Epoch 4/30: test accuracy: 90.3% ± 2.2%
Epoch 5/30: train accuracy: 91.3% ± 2.1%
Epoch 5/30: test accuracy: 90.3% ± 2.2%
Epoch 6/30: train accuracy: 91.8% ± 2.1%
Epoch 6/30: test accuracy: 90.4% ± 2.2%
Epoch 7/30: train accuracy: 92.4% ± 2.0%
Epoch 7/30: test accuracy: 91.0% ± 2.2%
Epoch 8/30: train accuracy: 92.8% ± 2.0%
Epoch 8/30: test accuracy: 91.3% ± 2.1%
Epoch 9/30: train accuracy: 93.1% ± 1.9%
Epoch 9/30: test accuracy: 91.6% ± 2.1%
Epoch 10/30: train accuracy: 93.4% ± 1.9%
Epoch 10/30: test accuracy: 91.9% ± 2.1%
Epoch 11/30: train accuracy: 93.4% ± 1.9%
Epoch 11/30: test accuracy: 91.8% ± 2.1%
Epoch 12/30: train accuracy: 93.6% ± 1.9%
Epoch 12/30: test accuracy: 92.1% ± 2.0%
Epoch 13/30: train accuracy: 93.7% ± 1.8%
Epoch 13/30: test accuracy: 92.4% ± 2.0%
Epoch 14/30: train accuracy: 93.7% ± 1.8%
Epoch 14/30: test accuracy: 92.5% ± 2.0%
Epoch 15/30: train accuracy: 93.9% ± 1.8%
Epoch 15/30: test accuracy: 92.8% ± 2.0%
Epoch 16/30: train accuracy: 94.0% ± 1.8%
Epoch 16/30: test accuracy: 92.8% ± 2.0%
Epoch 17/30: train accuracy: 94.0% ± 1.8%
Epoch 17/30: test accuracy: 92.8% ± 2.0%
Epoch 18/30: train accuracy: 94.2% ± 1.8%
Epoch 18/30: test accuracy: 92.8% ± 2.0%
Epoch 19/30: train accuracy: 94.2% ± 1.8%
Epoch 19/30: test accuracy: 92.8% ± 2.0%
Epoch 20/30: train accuracy: 94.2% ± 1.8%
Epoch 20/30: test accuracy: 93.1% ± 1.9%
Epoch 21/30: train accuracy: 94.3% ± 1.8%
Epoch 21/30: test accuracy: 93.3% ± 1.9%
Epoch 22/30: train accuracy: 94.3% ± 1.8%
Epoch 22/30: test accuracy: 93.3% ± 1.9%
Epoch 23/30: train accuracy: 94.5% ± 1.7%
Epoch 23/30: test accuracy: 93.3% ± 1.9%
Epoch 24/30: train accuracy: 94.5% ± 1.7%
Epoch 24/30: test accuracy: 93.3% ± 1.9%
Epoch 25/30: train accuracy: 94.6% ± 1.7%
Epoch 25/30: test accuracy: 93.4% ± 1.9%
Epoch 26/30: train accuracy: 94.6% ± 1.7%
Epoch 26/30: test accuracy: 93.3% ± 1.9%
Epoch 27/30: train accuracy: 94.6% ± 1.7%
Epoch 27/30: test accuracy: 93.4% ± 1.9%
Epoch 28/30: train accuracy: 94.8% ± 1.7%
Epoch 28/30: test accuracy: 93.4% ± 1.9%
Epoch 29/30: train accuracy: 94.8% ± 1.7%
Epoch 29/30: test accuracy: 93.3% ± 1.9%
Epoch 30/30: train accuracy: 94.8% ± 1.7%
Epoch 30/30: test accuracy: 93.4% ± 1.9%
Epoch 1/30: train accuracy: 90.7% ± 2.2%
Epoch 1/30: test accuracy: 89.9% ± 2.3%
Epoch 2/30: train accuracy: 90.9% ± 2.2%
Epoch 2/30: test accuracy: 90.3% ± 2.2%
Epoch 3/30: train accuracy: 91.6% ± 2.1%
Epoch 3/30: test accuracy: 90.3% ± 2.2%
Epoch 4/30: train accuracy: 92.2% ± 2.0%
Epoch 4/30: test accuracy: 90.7% ± 2.2%
Epoch 5/30: train accuracy: 92.4% ± 2.0%
Epoch 5/30: test accuracy: 91.3% ± 2.1%
Epoch 6/30: train accuracy: 92.5% ± 2.0%
Epoch 6/30: test accuracy: 91.8% ± 2.1%
Epoch 7/30: train accuracy: 93.0% ± 1.9%
Epoch 7/30: test accuracy: 92.2% ± 2.0%
Epoch 8/30: train accuracy: 93.1% ± 1.9%
Epoch 8/30: test accuracy: 92.7% ± 2.0%
Epoch 9/30: train accuracy: 93.3% ± 1.9%
Epoch 9/30: test accuracy: 92.5% ± 2.0%
Epoch 10/30: train accuracy: 93.4% ± 1.9%
Epoch 10/30: test accuracy: 92.7% ± 2.0%
Epoch 11/30: train accuracy: 93.6% ± 1.9%
Epoch 11/30: test accuracy: 92.8% ± 2.0%
Epoch 12/30: train accuracy: 93.7% ± 1.8%
Epoch 12/30: test accuracy: 92.8% ± 2.0%
Epoch 13/30: train accuracy: 94.0% ± 1.8%
Epoch 13/30: test accuracy: 93.0% ± 1.9%
Epoch 14/30: train accuracy: 93.9% ± 1.8%
Epoch 14/30: test accuracy: 93.0% ± 1.9%
Epoch 15/30: train accuracy: 94.2% ± 1.8%
Epoch 15/30: test accuracy: 93.0% ± 1.9%
Epoch 16/30: train accuracy: 94.2% ± 1.8%
Epoch 16/30: test accuracy: 93.0% ± 1.9%
Epoch 17/30: train accuracy: 94.3% ± 1.8%
Epoch 17/30: test accuracy: 93.0% ± 1.9%
Epoch 18/30: train accuracy: 94.5% ± 1.7%
Epoch 18/30: test accuracy: 93.1% ± 1.9%
Epoch 19/30: train accuracy: 94.5% ± 1.7%
Epoch 19/30: test accuracy: 93.1% ± 1.9%
Epoch 20/30: train accuracy: 94.6% ± 1.7%
Epoch 20/30: test accuracy: 93.3% ± 1.9%
Epoch 21/30: train accuracy: 94.8% ± 1.7%
Epoch 21/30: test accuracy: 93.3% ± 1.9%
Epoch 22/30: train accuracy: 94.8% ± 1.7%
Epoch 22/30: test accuracy: 93.4% ± 1.9%
Epoch 23/30: train accuracy: 94.8% ± 1.7%
Epoch 23/30: test accuracy: 93.4% ± 1.9%
Epoch 24/30: train accuracy: 94.8% ± 1.7%
Epoch 24/30: test accuracy: 93.4% ± 1.9%
Epoch 25/30: train accuracy: 94.8% ± 1.7%
Epoch 25/30: test accuracy: 93.4% ± 1.9%
Epoch 26/30: train accuracy: 94.9% ± 1.7%
Epoch 26/30: test accuracy: 93.6% ± 1.9%
Epoch 27/30: train accuracy: 94.9% ± 1.7%
Epoch 27/30: test accuracy: 93.6% ± 1.9%
Epoch 28/30: train accuracy: 94.9% ± 1.7%
Epoch 28/30: test accuracy: 93.6% ± 1.9%
Epoch 29/30: train accuracy: 95.1% ± 1.6%
Epoch 29/30: test accuracy: 93.6% ± 1.9%
Epoch 30/30: train accuracy: 95.1% ± 1.6%
Epoch 30/30: test accuracy: 93.6% ± 1.9%
runs_df = pd.concat(results)
# plot training loss and test loss over time
px.line(
runs_df,
line_group="run_id",
x="epoch",
y="loss",
color="type",
hover_data=["batch_size", "learning_rate", "dropout_fraction"],
facet_row="learning_rate",
facet_col="batch_size",
width=500,
).show()
# plot accuracy over time
px.line(
runs_df,
line_group="run_id",
x="epoch",
y="accuracy",
color="type",
hover_data=["batch_size", "learning_rate", "dropout_fraction"],
facet_row="learning_rate",
facet_col="batch_size",
width=500,
).show()
8. 绘制前后对比图,显示训练过程中找到的最佳矩阵的结果
矩阵越好,相似和不相似配对的分离就越清晰。
# apply result of best run to original data
best_run = runs_df.sort_values(by="accuracy", ascending=False).iloc[0]
best_matrix = best_run["matrix"]
apply_matrix_to_embeddings_dataframe(best_matrix, df)
# plot similarity distribution BEFORE customization
px.histogram(
df,
x="cosine_similarity",
color="label",
barmode="overlay",
width=500,
facet_row="dataset",
).show()
test_df = df[df["dataset"] == "test"]
a, se = accuracy_and_se(test_df["cosine_similarity"], test_df["label"])
print(f"Test accuracy: {a:0.1%} ± {1.96 * se:0.1%}")
# plot similarity distribution AFTER customization
px.histogram(
df,
x="cosine_similarity_custom",
color="label",
barmode="overlay",
width=500,
facet_row="dataset",
).show()
a, se = accuracy_and_se(test_df["cosine_similarity_custom"], test_df["label"])
print(f"Test accuracy after customization: {a:0.1%} ± {1.96 * se:0.1%}")
Test accuracy: 88.8% ± 2.4%
Test accuracy after customization: 93.6% ± 1.9%
best_matrix # this is what you can multiply your embeddings by
array([[-1.2566795e+00, -1.5297449e+00, -1.3271648e-01, ...,
-1.2859761e+00, -5.3254390e-01, 4.8364732e-01],
[-1.4826347e+00, 9.2656955e-02, -4.2437232e-01, ...,
1.1872858e+00, -1.0831847e+00, -1.0683593e+00],
[-2.2029283e+00, -1.9703420e+00, 3.1125939e-01, ...,
2.2947595e+00, 5.5780332e-03, -6.0171342e-01],
...,
[-1.1019799e-01, 1.3599515e+00, -4.7677776e-01, ...,
6.5626711e-01, 7.2359240e-01, 3.0733588e+00],
[ 1.6624762e-03, 4.2648423e-01, -1.1380885e+00, ...,
8.7202555e-01, 9.3173909e-01, -1.6760436e+00],
[ 7.7449006e-01, 4.9213606e-01, 3.5407653e-01, ...,
1.3460466e+00, -1.9509128e-01, 7.7514690e-01]], dtype=float32)
```