Column Non-Null Count DataGrid Type

可视化 Kangas 中的嵌入

在本 Jupyter Notebook 中，我们将构建一个包含嵌入数据和二维投影的 Kangas DataGrid。

什么是 Kangas？

Kangas 是一个开源的、混合媒体的、类似数据帧的工具，面向数据科学家。它由 Comet 公司开发，该公司致力于帮助用户减少将模型投入生产的阻力。

1. 设置

首先，我们 pip 安装 kangas，然后导入它。

%pip install kangas --quiet

import kangas as kg

2. 构建 Kangas DataGrid

我们创建一个包含原始数据和嵌入的 Kangas Datagrid。数据由多行评论组成，嵌入由 1536 个浮点值组成。在此示例中，我们直接从 GitHub 获取数据，以防您在此笔记本电脑内运行此笔记本电脑。

我们使用 Kangas 将 CSV 文件读入 DataGrid 以进行进一步处理。

data = kg.read_csv("https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv")

Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...


1001it [00:00, 2412.90it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]

我们可以查看 CSV 文件的字段：

data.info()

DataGrid (in memory)
    Name   : fine_food_reviews_with_embeddings_1k
    Rows   : 1,000
    Columns: 9
#   Column                Non-Null Count DataGrid Type       
--- -------------------- --------------- --------------------
1   Column 1                       1,000 INTEGER             
2   ProductId                      1,000 TEXT                
3   UserId                         1,000 TEXT                
4   Score                          1,000 INTEGER             
5   Summary                        1,000 TEXT                
6   Text                           1,000 TEXT                
7   combined                       1,000 TEXT                
8   n_tokens                       1,000 INTEGER             
9   embedding                      1,000 TEXT

并查看第一行和最后一行：

data

row-id	Column 1	ProductId	UserId	Score	Summary	Text	combined	n_tokens	embedding
1	0	B003XPF9BO	A3R7JR3FMEBXQB	5	where does one	Wanted to save	Title: where do	52	[0.007018072064
2	297	B003VXHGPK	A21VWSCGW7UUAR	4	Good, but not W	Honestly, I hav	Title: Good, bu	178	[-0.00314055196
3	296	B008JKTTUA	A34XBAIFT02B60	1	Should advertis	First, these sh	Title: Should a	78	[-0.01757248118
4	295	B000LKTTTW	A14MQ40CCU8B13	5	Best tomato sou	I have a hard t	Title: Best tom	111	[-0.00139322795
5	294	B001D09KAM	A34XBAIFT02B60	1	Should advertis	First, these sh	Title: Should a	78	[-0.01757248118
...
996	623	B0000CFXYA	A3GS4GWPIBV0NT	1	Strange inflamm	Truthfully wasn	Title: Strange	110	[0.000110913533
997	624	B0001BH5YM	A1BZ3HMAKK0NC	5	My favorite and	You've just got	Title: My favor	80	[-0.02086931467
998	625	B0009ET7TC	A2FSDQY5AI6TNX	5	My furbabies LO	Shake the conta	Title: My furba	47	[-0.00974910240
999	619	B007PA32L2	A15FF2P7RPKH6G	5	got this for th	all i have hear	Title: got this	50	[-0.00521062919
1000	999	B001EQ5GEO	A3VYU0VO6DYV6I	5	I love Maui Cof	My first experi	Title: I love M	118	[-0.00605782261
[1000 rows x 9 columns]

* Use DataGrid.save() to save to disk
** Use DataGrid.show() to start user interface

现在，我们创建一个新的 DataGrid，将数字转换为 Embedding：

import ast # 将数字列表的字符串转换为数字列表

dg = kg.DataGrid(
    name="openai_embeddings",
    columns=data.get_columns(),
    converters={"Score": str},
)
for row in data:
    embedding = ast.literal_eval(row[8])
    row[8] = kg.Embedding(
        embedding, 
        name=str(row[3]), 
        text="%s - %.10s" % (row[3], row[4]),
        projection="umap",
    )
    dg.append(row)

新的 DataGrid 现在具有正确的嵌入列数据类型。

dg.info()

DataGrid (in memory)
    Name   : openai_embeddings
    Rows   : 1,000
    Columns: 9
#   Column                Non-Null Count DataGrid Type       
--- -------------------- --------------- --------------------
1   Column 1                       1,000 INTEGER             
2   ProductId                      1,000 TEXT                
3   UserId                         1,000 TEXT                
4   Score                          1,000 TEXT                
5   Summary                        1,000 TEXT                
6   Text                           1,000 TEXT                
7   combined                       1,000 TEXT                
8   n_tokens                       1,000 INTEGER             
9   embedding                      1,000 EMBEDDING-ASSET

我们只需保存 datagrid，就完成了。

dg.save()

3. 渲染 2D 投影

要直接在笔记本中渲染数据，只需显示它即可。请注意，每一行都包含一个嵌入投影。

滚动到最右侧以查看每行的嵌入投影。

投影空间中点的颜色代表分数。

dg.show()

按“Score”分组以查看每个组的行。

dg.show(group="Score", sort="Score", rows=5, select="Score,embedding")

此数据网格的一个示例托管在此处：https://kangas.comet.com/?datagrid=/data/openai_embeddings.datagrid