金融文档分析与 LlamaIndex

在本示例 Notebook 中，我们将展示如何使用 LlamaIndex 框架，仅用几行代码即可对 10-K 文档进行金融分析。

Notebook 概述

引言
设置
数据加载与索引
简单问答
高级问答 - 比较与对比

引言

LlamaIndex

LlamaIndex 是一个面向 LLM 应用程序的数据框架。您只需几行代码即可入门，并在几分钟内构建一个检索增强生成 (RAG) 系统。对于更高级的用户，LlamaIndex 提供了一个丰富的工具集，用于摄取和索引数据，检索和重新排序的模块，以及用于构建自定义查询引擎的可组合组件。

请参阅完整文档以了解更多详情。

10-K 文档的金融分析

金融分析师工作的一个关键部分是从长篇金融文档中提取信息和综合见解。一个很好的例子是 10-K 表格——美国证券交易委员会 (SEC) 要求提交的年度报告，它全面总结了公司的财务业绩。这些文档通常有数百页长，并包含特定领域的术语，这使得普通人难以快速消化。

我们展示了 LlamaIndex 如何通过很少的代码支持金融分析师快速提取信息和跨多个文档综合见解。

设置

首先，我们需要安装 llama-index 库

!pip install llama-index pypdf

现在，我们导入本教程中使用的所有模块

from langchain import OpenAI

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index import set_global_service_context
from llama_index.response.pprint_utils import pprint_response
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

在开始之前，我们可以配置将为我们的 RAG 系统提供支持的 LLM 提供商和模型。在这里，我们选择 OpenAI 的 gpt-3.5-turbo-instruct。

llm = OpenAI(temperature=0, model_name="gpt-3.5-turbo-instruct", max_tokens=-1)

我们构建一个 ServiceContext 并将其设置为全局默认值，因此所有后续依赖于 LLM 调用的操作都将使用我们在此配置的模型。

service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context=service_context)

数据加载与索引

现在，我们加载并解析 2 个 PDF 文件（分别对应 Uber 2021 年和 Lyft 2021 年的 10-K 文件）。在后台，PDF 文件被转换为纯文本 Document 对象，按页面分隔。

注意：此操作可能需要一段时间才能运行，因为每个文档都有 100 多页。

lyft_docs = SimpleDirectoryReader(input_files=["../data/10k/lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["../data/10k/uber_2021.pdf"]).load_data()

print(f'Loaded lyft 10-K with {len(lyft_docs)} pages')
print(f'Loaded Uber 10-K with {len(uber_docs)} pages')

Loaded lyft 10-K with 238 pages
Loaded Uber 10-K with 307 pages

现在，我们可以构建一个（内存中的）VectorStoreIndex 来处理我们加载的文档。

注意：此操作可能需要一段时间才能运行，因为它会调用 OpenAI API 来计算文档块的向量嵌入。

lyft_index = VectorStoreIndex.from_documents(lyft_docs)
uber_index = VectorStoreIndex.from_documents(uber_docs)

简单问答

现在我们准备好针对我们的索引运行一些查询了！为此，我们首先配置一个 QueryEngine，它只是捕获了一组关于我们希望如何查询底层索引的配置。

对于 VectorStoreIndex，最常见的配置是 similarity_top_k，它控制检索多少文档块（我们称之为 Node 对象）作为上下文来回答我们的问题。

lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)

uber_engine = uber_index.as_query_engine(similarity_top_k=3)

让我们看看一些查询的实际效果！

response = await lyft_engine.aquery('What is the revenue of Lyft in 2021? Answer in millions with page reference')

print(response)

$3,208.3 million (page 63)

response = await uber_engine.aquery('What is the revenue of Uber in 2021? Answer in millions, with page reference')

print(response)

$17,455 (page 53)

高级问答 - 比较与对比

对于更复杂的金融分析，通常需要引用多个文档。

例如，让我们看看如何对 Lyft 和 Uber 的财务数据进行比较和对比查询。为此，我们构建一个 SubQuestionQueryEngine，它将复杂的比较和对比查询分解为更简单的子问题，以在由各个索引支持的相应子查询引擎上执行。

query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine, 
        metadata=ToolMetadata(name='lyft_10k', description='Provides information about Lyft financials for year 2021')
    ),
    QueryEngineTool(
        query_engine=uber_engine, 
        metadata=ToolMetadata(name='uber_10k', description='Provides information about Uber financials for year 2021')
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

让我们看看这些查询的实际效果！

response = await s_engine.aquery('Compare and contrast the customer segments and geographies that grew the fastest')

Generated 4 sub questions.
 [36;1m [1;3m[uber_10k] Q: What customer segments grew the fastest for Uber
 [0m [36;1m [1;3m[uber_10k] A: in 2021?

The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth.
 [0m [33;1m [1;3m[uber_10k] Q: What geographies grew the fastest for Uber
 [0m [33;1m [1;3m[uber_10k] A: 
Based on the context information, it appears that Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.
 [0m [38;5;200m [1;3m[lyft_10k] Q: What customer segments grew the fastest for Lyft
 [0m [38;5;200m [1;3m[lyft_10k] A: 
The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them.
 [0m [32;1m [1;3m[lyft_10k] Q: What geographies grew the fastest for Lyft
 [0m [32;1m [1;3m[lyft_10k] A: 
It is not possible to answer this question with the given context information.
 [0m

print(response)

The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth. Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.

The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.

In summary, Uber and Lyft both experienced growth in customer segments related to mobility, couriers, riders, and eaters. Uber experienced the most growth in large metropolitan areas, as well as in suburban and rural areas, and in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain. Lyft experienced the most growth in ridesharing, light vehicles, and public transit. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.

response = await s_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

Generated 2 sub questions.
 [36;1m [1;3m[uber_10k] Q: What is the revenue growth of Uber from 2020 to 2021
 [0m [36;1m [1;3m[uber_10k] A: 
The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis.
 [0m [33;1m [1;3m[lyft_10k] Q: What is the revenue growth of Lyft from 2020 to 2021
 [0m [33;1m [1;3m[lyft_10k] A: 
The revenue growth of Lyft from 2020 to 2021 is 36%, increasing from $2,364,681 thousand to $3,208,323 thousand.
 [0m

print(response)

The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis, while the revenue growth of Lyft from 2020 to 2021 was 36%. This means that Uber had a higher revenue growth than Lyft from 2020 to 2021.