交易分类聚类

本笔记介绍了您的数据未标记但具有可用于将其聚类为有意义类别的特征的用例。聚类的挑战在于使使这些聚类脱颖而出的特征易于人类阅读,而这正是我们将利用 GPT-3 为我们生成有意义的聚类描述的地方。然后,我们可以使用这些描述将标签应用于先前未标记的数据集。

为了向模型提供信息,我们使用了通过笔记本 多类别交易分类笔记本 中显示的方法创建的嵌入,并将其应用于数据集中的全部 359 笔交易,以便为学习提供更大的池。

设置

# optional env import
from dotenv import load_dotenv
load_dotenv()
True

聚类

我们将重用 聚类笔记本 中的方法,使用 K-Means 使用我们之前创建的特征嵌入来聚类我们的数据集。然后,我们将使用 Completions 端点为我们生成聚类描述并评估其有效性。

df = pd.read_csv(embedding_path)
df.head()
Date Supplier Description Transaction value (£) combined n_tokens embedding
0 21/04/2016 M & J Ballantyne Ltd George IV Bridge Work 35098.0 Supplier: M & J Ballantyne Ltd; Description: G... 118 [-0.013169967569410801, -0.004833734128624201,...
1 26/04/2016 Private Sale Literary & Archival Items 30000.0 Supplier: Private Sale; Description: Literary ... 114 [-0.019571533426642418, -0.010801066644489765,...
2 30/04/2016 City Of Edinburgh Council Non Domestic Rates 40800.0 Supplier: City Of Edinburgh Council; Descripti... 114 [-0.0054041435942053795, -6.548957026097924e-0...
3 09/05/2016 Computacenter Uk Kelvin Hall 72835.0 Supplier: Computacenter Uk; Description: Kelvi... 113 [-0.004776035435497761, -0.005533686839044094,...
4 09/05/2016 John Graham Construction Ltd Causewayside Refurbishment 64361.0 Supplier: John Graham Construction Ltd; Descri... 117 [0.003290407592430711, -0.0073441751301288605,...
embedding_df = pd.read_csv(embedding_path)
embedding_df["embedding"] = embedding_df.embedding.apply(literal_eval).apply(np.array)
matrix = np.vstack(embedding_df.embedding.values)
matrix.shape
(359, 1536)
n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
kmeans.fit(matrix)
labels = kmeans.labels_
embedding_df["Cluster"] = labels
tsne = TSNE(
    n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200
)
vis_dims2 = tsne.fit_transform(matrix)

x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]

for category, color in enumerate(["purple", "green", "red", "blue","yellow"]):
    xs = np.array(x)[embedding_df.Cluster == category]
    ys = np.array(y)[embedding_df.Cluster == category]
    plt.scatter(xs, ys, color=color, alpha=0.3)

    avg_x = xs.mean()
    avg_y = ys.mean()

    plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
plt.title("Clusters identified visualized in language 2d using t-SNE")
Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')

png

# We'll read 10 transactions per cluster as we're expecting some variation
transactions_per_cluster = 10

for i in range(n_clusters):
    print(f"Cluster {i} Theme:\n")

    transactions = "\n".join(
        embedding_df[embedding_df.Cluster == i]
        .combined.str.replace("Supplier: ", "")
        .str.replace("Description: ", ":  ")
        .str.replace("Value: ", ":  ")
        .sample(transactions_per_cluster, random_state=42)
        .values
    )
    response = client.chat.completions.create(
        model=COMPLETIONS_MODEL,
        # We'll include a prompt to instruct the model what sort of description we're looking for
        messages=[
            {"role": "user",
             "content": f'''We want to group these transactions into meaningful clusters so we can target the areas we are spending the most money. 
                What do the following transactions have in common?\n\nTransactions:\n"""\n{transactions}\n"""\n\nTheme:'''}
        ],
        temperature=0,
        max_tokens=100,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    print(response.choices[0].message.content.replace("\n", ""))
    print("\n")

    sample_cluster_rows = embedding_df[embedding_df.Cluster == i].sample(transactions_per_cluster, random_state=42)
    for j in range(transactions_per_cluster):
        print(sample_cluster_rows.Supplier.values[j], end=", ")
        print(sample_cluster_rows.Description.values[j], end="\n")

    print("-" * 100)
    print("\n")
Cluster 0 Theme:

这些交易的共同主题是它们都涉及在各种费用上花钱,例如电费、非居民费、IT 设备、计算机设备以及购买电动货车。


EDF ENERGY, Electricity Oct 2019 3 buildings
City Of Edinburgh Council, Non Domestic Rates 
EDF, Electricity
EX LIBRIS, IT equipment
City Of Edinburgh Council, Non Domestic Rates 
CITY OF EDINBURGH COUNCIL, Rates for 33 Salisbury Place
EDF Energy, Electricity
XMA Scotland Ltd, IT equipment
Computer Centre UK Ltd, Computer equipment
ARNOLD CLARK, Purchase of an electric van
----------------------------------------------------------------------------------------------------


Cluster 1 Theme:

这些交易的共同主题是它们都涉及支付各种商品和服务。一些具体示例包括学生助学金费用、文件收集、建筑工程、法律存款服务、与 Alisdair Gray 相关的论文、关于奴隶制废除和社会正义的资源、收藏品、在线/印刷订阅、ALDL 费用以及文学/档案物品。


Institute of Conservation, This payment covers 2 invoices for student bursary costs
PRIVATE SALE, Collection of papers of an individual
LEE BOYD LIMITED, Architectural Works
ALDL, Legal Deposit Services
RICK GEKOSKI, Papers 1970's to 2019 Alisdair Gray
ADAM MATTHEW DIGITAL LTD, Resource -  slavery abolution and social justice
PROQUEST INFORMATION AND LEARN, This payment covers multiple invoices for collection items
LM Information Delivery UK LTD, Payment of 18 separate invoice for Online/Print subscriptions Jan 20-Dec 20
ALDL, ALDL Charges
Private Sale, Literary & Archival Items
----------------------------------------------------------------------------------------------------


Cluster 2 Theme:

这些交易的共同主题是它们都涉及在凯尔文大厅 (Kelvin Hall) 花钱。


CBRE, Kelvin Hall
GLASGOW CITY COUNCIL, Kelvin Hall
University Of Glasgow, Kelvin Hall
GLASGOW LIFE, Oct 20 to Dec 20 service charge - Kelvin Hall
Computacenter Uk, Kelvin Hall
XMA Scotland Ltd, Kelvin Hall
GLASGOW LIFE, Service Charges Kelvin Hall 01/07/19-30/09/19
Glasgow Life, Kelvin Hall Service Charges
Glasgow City Council, Kelvin Hall
GLASGOW LIFE, Quarterly service charge KH
----------------------------------------------------------------------------------------------------


Cluster 3 Theme:

这些交易的共同主题是它们都涉及支付 ECG Facilities Service 提供的设施管理费用和服务。


ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees
ECG FACILITIES SERVICE, Facilities Management Charge
ECG FACILITIES SERVICE, Inspection and Maintenance of all Library properties
ECG FACILITIES SERVICE, Facilities Management Charge
ECG FACILITIES SERVICE, Maintenance contract - October
ECG FACILITIES SERVICE, Electrical and mechanical works
ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees
ECG FACILITIES SERVICE, CB Bolier Replacement (1),USP Batteries,Gutter Works & Cleaning of pigeon fouling
ECG FACILITIES SERVICE, Facilities Management Charge
ECG FACILITIES SERVICE, Facilities Management Charge
----------------------------------------------------------------------------------------------------


Cluster 4 Theme:

这些交易的共同主题是它们都涉及建筑或翻新工程。


M & J Ballantyne Ltd, George IV Bridge Work
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
ARTHUR MCKAY BUILDING SERVICES, Causewayside Work
John Graham Construction Ltd, Causewayside Refurbishment
Morris & Spottiswood Ltd, George IV Bridge Work
ECG FACILITIES SERVICE, Causewayside IT Work
John Graham Construction Ltd, Causewayside Refurbishment
----------------------------------------------------------------------------------------------------

结论

我们现在有了五个可以用来描述我们数据的新集群。从可视化来看,我们的一些集群存在一些重叠,我们需要进行一些调整才能达到正确的位置,但我们已经可以看到 GPT-3 做出了一些有效的推断。特别是,它发现涉及法律存款的物品与文学档案有关,这是事实,但模型没有得到任何提示。非常酷,并且通过一些调整,我们可以创建一组基础集群,然后将其与多类别分类器一起使用,以推广到我们可能使用的其他交易数据集。