构建一个可自我评估的金融聊天机器人：一段穿越数据、代码与挣扎的旅程

更新（2026）： 下文这个金融聊天机器人项目后来演进成了 S&P 500 agent（MVP 于 2024 年 9 月上线）。发布后我把重点转回 Sydney（博客内容）以及两个新产品：STRATUM（营销智能）和 DIALOGUE（播客生成）。这里描述的 SEC 数据管线与自评估模式，后来都影响了这些产品。

Ask Sydney →

以下保留 2024 年 4 月原文以供参考。

这篇里我想分享一个很有意思的 hobby project：做一个具备自我评估能力的金融聊天机器人/agent。这个项目的目标，是让 agent 能基于官方 SEC filings 数据回答美国 S&P 500 公司间的财务状况与趋势问题，尽可能保证准确并避免幻觉。这个 agent 的核心收益是（或者说未来应该会是 :P）：

让你可以提问美国 S&P 500 公司之间的财务状况与趋势问题。
它直接使用这些公司提交 SEC 的官方财务数据，来 避免幻觉。
它会在每条回答下提供 详细引用，你可以自行 fact-check。
数据库覆盖过去 5-10 年 filings，因此可让 agent 推理 时间趋势。
在每条回答返回前，会有另一个 agent 先对 LLM 的 草稿答案 做 批判性审阅 并提出改进建议。
这些建议连同上下文会再喂回给原 agent。原 agent 随后可决定是 重新构造检索查询 以 获取更好信息，或直接吸收建议改写。
最终答案才会返回给用户。

当然，我没有 Bloomberg Terminal，也不知道 Bloomberg 的 chatbot 是否已经完整覆盖以上能力。（我猜是“在一定程度上可以”。但它的自我批判与修订做得多好，我也不确定，很想知道 :P）

总之，我想试做这个，因为它对我当前学习阶段来说足够复杂，而且也可能真的有用。

我想回答的问题类型是：

Apple 过去 5-10 年营销支出趋势如何？
某公司过去 5 年的主要并购有哪些？
对比 Nvidia 和 Microsoft 过去 5 年 R&D 支出
等等。

那这个项目我目前进展到哪？学到了什么？又卡在了哪里？

如何拿到 SEC 数据？

最早期挑战之一，就是拿数据。SEC 虽然提供了访问 EDGAR 数据的指南（和开发入口 here），但真正理解如何规模化下载每家公司的 filings，还是花了我不少时间。

有没有办法先绕过“自己下载并处理 filings”这一步，继续往后推进？

我翻 Langchain 文档时发现了一个 financial retriever：Kay.ai。我先用它做了验证，看看后续流程是否能跑。基础查询没问题，但它不支持异步调用，也缺少 高级 metadata 过滤。所以我最终决定还是自己做这部分。

共享实际 Python 脚本

用于从 SEC EDGAR 下载 filings 的 Python 脚本

经历大量 trial-and-error（以及 chatGPT 帮忙）后，这个脚本可以自动下载美国 SEC 的 XBRL 与 TXT filings。它会根据公司 CIK 拉取最近 filings。

为什么我同时下载 .zip 和 .txt？

每家公司的 .txt 文件非常完整。它包含大量有价值 metadata，比如 form type（10K/10Q）、reporting period、filed date、company CIK/name 等等。这些 metadata 后面构建 agent 时都要用。而且它们在文件头部结构化得很整齐，例如：

<SEC-DOCUMENT>0000320193-19-000119.txt : 20191031
<SEC-HEADER>0000320193-19-000119.hdr.sgml : 20191031
<ACCEPTANCE-DATETIME>20191030181236
ACCESSION NUMBER:		0000320193-19-000119
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		96
CONFORMED PERIOD OF REPORT:	20190928
FILED AS OF DATE:		20191031
DATE AS OF CHANGE:		20191030

FILER:

	COMPANY DATA:
		COMPANY CONFORMED NAME:			Apple Inc.
		CENTRAL INDEX KEY:			0000320193
		STANDARD INDUSTRIAL CLASSIFICATION:	ELECTRONIC COMPUTERS [3571]
		IRS NUMBER:				942404110
		STATE OF INCORPORATION:			CA
		FISCAL YEAR END:			0928

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-36743
		FILM NUMBER:		191181423

	BUSINESS ADDRESS:
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014
		BUSINESS PHONE:		(408) 996-1010

	MAIL ADDRESS:
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014

	FORMER COMPANY:
		FORMER CONFORMED NAME:	APPLE INC
		DATE OF NAME CHANGE:	20070109

	FORMER COMPANY:
		FORMER CONFORMED NAME:	APPLE COMPUTER INC
		DATE OF NAME CHANGE:	19970808
</SEC-HEADER>

.txt 文件还有大量其他信息，不止核心财报，所以单文件通常很大（10+ MB，甚至 40+ MB）。如果按一家公司过去 5-10 年所有 10K/10Q 下载，轻松就是 20+ 文件。直接在这些大文件上做清洗/chunk，不仅耗时，而且 embedding 成本会非常夸张。所以我需要另一条路。

这时 .zip 就派上用场了。每个 .zip 里包含核心财报 .htm 及其他内容。问题是核心 .htm 报告跨年份、跨公司命名都不统一；而 .htm 又不像 .txt 那样自带整齐 metadata。

把 metadata 和主财报内容组合起来

这个脚本自动完成：从多家公司 filings 中提取财务报表及其关联 metadata。它处理的输入就是从 SEC EDGAR 下载的 .zip 和 .txt 文件。

这里我做了几个假设：

每份财报都同时下载 .txt 与 .zip
每个 .zip 里最大的 .htm 文件就是核心财报。按经验这通常成立，因为其他内容都是从主 .htm 派生出来的。

在 chunk 前先清理无用内容/字符

好消息是按上述做法后，每份报告（10K 或 10Q）通常都压到 3 MB 以下。但它仍然很长，且包含不少我们不需要的信息，所以在 chunk 前还要进一步清洗。否则 embedding 会很慢也很贵。想象一下，如果每份报告 embedding 只花 $0.1，那么仅 5 年 10K+10Q 就是每家公司约 $2.5。若覆盖大多数 S&P 500 或把时间拉到 10 年，成本会很快堆起来。

所以我写了这个清洗脚本。处理后每个输出文件都小于 0.2 MB，约 10 倍压缩。同时仍保留前面说的关键 metadata。

到这里就可以进入 chunking/embedding 阶段了。

我应该选哪个向量库？

可选项很多（见 here，50+ 种）。但因为我下一步需要复杂 metadata filtering，self-querying retriever 看起来是合适路线。我说“看起来”是因为它也有代价。我目前试了 Chroma、ElasticSearch 和 FAISS。

Chroma 和 ElasticSearch 功能都很强，但索引体积偏大（Chroma 550+ MB，ElasticSearch 800+ MB），而且这些索引只覆盖 5 家测试公司 embeddings。按这个比例扩展到全 S&P 500，最终索引可能放大 100 倍。对本地笔记本显然不友好 :D

而 FAISS 在同样 5 家测试集下索引只有 200+ MB。问题是 FAISS 在高级过滤/和 Langchain 原生集成方面不如前两者，所以还要继续研究。如果你有建议欢迎告诉我。（也可能要进一步测试 Weaviate、Pinecone 等替代方案。）

Query structuring

即使我在每个 filing/chunk 中都加了 metadata，并且一并做了 embedding 存入向量库，下一步问题依旧是：怎么告诉机器“该用哪些过滤条件”？Langchain 的 Lance 在这个视频讲了“Query structuring for metadata filters”。

我参考后做了下面这个 pydantic 对象。字段对应 filings 里实际 metadata，如 form type（10-K/10-Q）、reporting period、filed date、company identifiers 等。

import datetime
from typing import Optional
from pydantic import BaseModel, Field

class FinancialFilingsSearch(BaseModel):
    """Search over a database of financial filings for a company, using the actual metadata tags from the filings."""

    content_search: str = Field(
        ...,
        description="Similarity search query applied to the content of the financial filings with the SEC.",
    )
    conformed_submission_type: str = Field(
        None,
        description="Filter for the type of the SEC filing, such as 10-K (annual report) or 10-Q (quarterly report). ",
    )
    conformed_period_of_report: Optional[datetime.date] = Field(
        None,
        description= "Filter for the end date (format: YYYYMMDD) of the reporting period for the filing. For a 10-Q, it's the quarter-end date, and for a 10-K, it's the fiscal year-end date. ",
    )
    filed_as_of_date: Optional[datetime.date] = Field(
        None,
        description="Filter for the date (YYYYMMDD) on which the filing was officially submitted to the SEC. Only use if explicitly specified.",
    )
    # date_as_of_change: Optional[datetime.date] = Field(
    #     None,
    #     description="If any information in the filing was updated or amended after the initial filing date, this date reflects when those changes were made.",
    # )
    company_conformed_name: str = Field(
        None,
        description="Filter for official name of the company as registered with the SEC",
    )
    central_index_key: str = Field(
        None,
        description="Central Index Key (CIK): A unique identifier assigned by the SEC to all entities (companies, individuals, etc.) who file with the SEC.",
    )
    standard_industrial_classification: Optional[str] = Field(
        None,
        description="he Standard Industrial Classification Codes that appear in a company's disseminated EDGAR filings indicate the company's type of business. Only use if explicitly specified.",
    )
    # irs_number: Optional[str] = Field(
    #     None,
    #     description="IRS number to filter by.",
    # )
    # state_of_incorporation: Optional[str] = Field(
    #     None,
    #     description="State of incorporation to filter by.",
    # )
    # fiscal_year_end: Optional[str] = Field(
    #     None,
    #     description="The end date of the company's fiscal year, which is used for financial reporting and taxation purposes, like Dec 31 or Sep30",
    # )
    form_type: str = Field(
        None,
        description="Form type to filter by, such as 10-K or 10-Q.",
    )
    # sec_file_number: Optional[str] = Field(
    #     None,
    #     description="SEC file number to filter by.",
    # )
    # film_number: Optional[str] = Field(
    #     None,
    #     description="Film number to filter by.",
    # )
    # former_company: Optional[str] = Field(
    #     None,
    #     description="Former company name to filter by.",
    # )
    # former_conformed_name: Optional[str] = Field(
    #     None,
    #     description="Former conformed name to filter by.",
    # )
    # date_of_name_change: Optional[datetime.date] = Field(
    #     None,
    #     description="Date of name change to consider.",
    # )

# Set up language models
llm_35 = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)  # GPT-3.5 model
llm_4 = ChatOpenAI(model="gpt-4-turbo-2024-04-09", temperature=0)  # GPT-4 model for more complex tasks

from langchain_core.prompts import ChatPromptTemplate

system = """You are an expert at converting user questions into database queries. \
You have access to a vector store of financial filings from public companies to the SEC, for building LLM-powered application. \
Given a question, return a detailed database query optimized to retrieve the most relevant results. \
Be as detailed as  possible with your returned query, including all relevant fields and filters. \
Always include conformed_period_of_report. \

If there are acronyms or words you are not familiar with, do not try to rephrase them."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)

# Assuming `llm` is your already initialized LLM instance
structured_llm = llm_35.with_structured_output(FinancialFilingsSearch)
query_analyzer = prompt | structured_llm

例如下面这些问题及 LLM 返回，你会看到它在某些 case 里会漏掉 “conformed period of reporting”：

Question: What was Google's advertising and marketing spending in the 10-K report for the year 2018?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'central_index_key': '0001652044'}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2019?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'central_index_key': '0001652044'}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2020?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'conformed_period_of_report': '2020'}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2021?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'conformed_period_of_report': '2021'}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2022?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'form_type': '10-K'}
Question: How has Google's advertising and marketing spending trended from 2018 to 2022 according to 10-K filings?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'form_type': '10-K'}

所以我写这篇超长（真的很长）文章到底想达成什么？

1. 我分享的 Python 代码能在某些地方帮到你。

2. 你愿意给我一些思路/建议，比如：

我现在做的 query structuring 如何进一步改进？（FYI：我也在用 LLM 生成输入问题对应的子问题）
embedding/向量库使用，尤其 filtering 部分，如何优化？
或者任何我没想到的建议 :)

2024 年 7 月更新

这个 chatbot 目前 还没完全跑通，所以如果你现在去站里问金融问题，它还答不出来 :D

保持好奇心 gap 吧 :)

你做过 SEC filings 相关项目，或者做过自评估 agent 吗？很想听你是怎么处理向量库和 metadata filtering 这些挑战的。

致敬，

Chandler

P.S：最近有门新课我准备学：Coursera 上的 “Generative AI for Software Development Skill Certificate”。

构建一个可自我评估的金融聊天机器人：一段穿越数据、代码与挣扎的旅程

如何拿到 SEC 数据？

有没有办法先绕过“自己下载并处理 filings”这一步，继续往后推进？

共享实际 Python 脚本

用于从 SEC EDGAR 下载 filings 的 Python 脚本

为什么我同时下载 .zip 和 .txt？

把 metadata 和主财报内容组合起来

在 chunk 前先清理无用内容/字符

我应该选哪个向量库？

Query structuring

所以我写这篇超长（真的很长）文章到底想达成什么？

2024 年 7 月更新

继续阅读

三个月后：还在写代码、还在学习、偶尔还是会卡住

对我当前聊天机器人的一次升级

Chatbot v2.10 发布：通过更快速度、更强扩展性与更简体验提升用户感受

我如何借助 AI Agent 从“编码流沙”中爬出来

我试过把课程剪成 YouTube 视频，结果发现必须从头重做

我为什么在 13 个月后取消了 Claude Max，以及接下来 30 天我准备如何用 Codex 做测试