Skip to content
··閱讀時間5分鐘

整一個自我評估嘅金融Chatbot:一段穿越數據、Code同掙扎嘅旅程

我整咗一個金融chatbot,用SEC數據回答S&P 500問題——仲有一個self-critique agent喺你見到答案之前先improve佢。

更新(2026年): 下面嘅金融chatbot項目最終成為咗S&P 500 agent(MVP於2024年9月launch)。Ship咗之後,我將focus轉向Sydney做blog內容,同兩個新產品:STRATUM(marketing intelligence)同**DIALOGUE**(podcast generation)。呢度描述嘅SEC data pipeline同self-evaluation patterns影響咗佢哋全部。

問Sydney →


2024年4月嘅原文保留如下作為context。

喺呢篇blog post入面,我想分享我做一個fascinating hobby project嘅經驗——build一個有self-evaluation能力嘅金融chatbot/agent。呢個項目嘅目標係create一個agent,可以回答關於美國S&P 500公司嘅財務狀況同趨勢嘅問題,直接用官方SEC filing嘅數據嚟確保準確度同避免hallucination。呢個agent嘅主要好處係(或者將會係 :P):

  • 俾你問關於美國S&P 500公司嘅財務狀況同趨勢嘅問題。
  • 佢直接用呢啲公司向SEC提交嘅官方財務filing數據,嚟避免hallucination
  • 佢喺每個答案下面提供詳細嘅references,等你可以fact-check
  • Database有過去5-10年嘅financial filing數據,所以你可以問agent reason about 隨時間變化嘅趨勢
  • 喺每個答案返回俾你之前,有另一個agent,佢嘅工作係critique LLM嘅draft答案同propose點樣improve
  • 呢啲suggestions同成個context然後分享俾original agent。Agent可以決定formulate不同嘅queriesretrieve更好嘅information,或者simply incorporate啲suggestions入最終答案
  • 呢個最終答案然後提供俾用戶。

當然,我冇access到Bloomberg terminal,所以我唔知Bloomberg chatbot係咪已經可以做到以上所有。(我嘅educated guess係可以——某程度上。Self-critique同revision部分有幾好,我唔太肯定但好想知 :P)

Anyway,我想試吓因為佢feel complex enough for my learning at this stage同佢potentially有用。

我想回答嘅問題例如:

  • Apple過去5-10年嘅marketing spend趨勢係點?
  • XYZ公司過去5年所有major acquisitions係乜?
  • 比較Nvidia同Microsoft過去5年嘅R&D spend
  • 等等。

Ok咁我呢個project而家去到邊?學到咗乜?有咩struggles?

我點樣從SEC拎到數據?

我面對嘅初始challenges之一係從SEC獲取必要嘅數據。雖然SEC提供咗access EDGAR數據嘅guides同documentation(同呢度),但要理解大規模download每間公司嘅financial filings嘅過程需要啲時間。

有冇其他方法讓我唔使自己download同process SEC嘅filings就proceed到下一步?

我睇咗Langchain嘅documentation,搵到一個叫Kay.ai嘅financial retriever。我test咗呢個retriever睇吓rest of the workflow可能點work。Retriever對basic queries work as expected。但佢唔support asynchronous calls或advanced metadata filtering。所以我決定繼續explore自己做呢部分。

分享實際嘅Python scripts

從SEC EDGAR download financial filings嘅Python script

經過大量trials and errors,加上chatGPT嘅幫助,呢個係自動由美國證券交易委員會(SEC)database download XBRL同TXT filings嘅code。佢設計嚟用Central Index Key(CIK)fetch指定公司嘅recent filings。

點解我想download .zip文件同.txt文件?

每間公司嘅.txt文件超comprehensive。佢有好多valuable metadata關於每個filing,例如form type(10K或10Q)、reporting period、filed date、company CIK/name等等。呢啲係我之後build agent需要嘅metadata類型。所有呢啲都整齊咁capture喺文件頂部:

<SEC-DOCUMENT>0000320193-19-000119.txt : 20191031
<SEC-HEADER>0000320193-19-000119.hdr.sgml : 20191031
<ACCEPTANCE-DATETIME>20191030181236
ACCESSION NUMBER:		0000320193-19-000119
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		96
CONFORMED PERIOD OF REPORT:	20190928
FILED AS OF DATE:		20191031
DATE AS OF CHANGE:		20191030

FILER:

	COMPANY DATA:
		COMPANY CONFORMED NAME:			Apple Inc.
		CENTRAL INDEX KEY:			0000320193
		STANDARD INDUSTRIAL CLASSIFICATION:	ELECTRONIC COMPUTERS [3571]
		IRS NUMBER:				942404110
		STATE OF INCORPORATION:			CA
		FISCAL YEAR END:			0928

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-36743
		FILM NUMBER:		191181423

	BUSINESS ADDRESS:
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014
		BUSINESS PHONE:		(408) 996-1010

	MAIL ADDRESS:
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014

	FORMER COMPANY:
		FORMER CONFORMED NAME:	APPLE INC
		DATE OF NAME CHANGE:	20070109

	FORMER COMPANY:
		FORMER CONFORMED NAME:	APPLE COMPUTER INC
		DATE OF NAME CHANGE:	19970808
</SEC-HEADER>

.txt文件仲有好多其他數據,除咗核心financial report。因為咁,每個文件都好大(好似10+ MB甚至40+ MB)。而一間公司,我想download過去5-10年嘅所有10K同10Q,所以每間公司輕易就20+個文件。Try to process/clean up不必要嘅characters/chunk呢啲大文件係極度低效嘅,因為用普通laptop會花好耐時間,而且embedding嘅cost會係astronomical。所以我要搵另一個方法。

呢度就係.zip文件嘅用處。喺每個.zip文件入面,你會搵到.htm格式嘅核心financial report同其他內容。問題係呢啲核心financial reports喺唔同年份、唔同公司之間冇任何consistent嘅命名方式。而且.htm report冇像.txt文件咁整齊嘅valuable metadata。

將metadata同主要financial report結合

呢個係自動化由多間公司嘅filing中extract financial statements同associated metadata嘅script。佢設計嚟handle由SEC EDGAR database download嘅文件,包括.zip同.txt文件。

你可以見到我做咗幾個assumptions:

  • 你download咗每個financial report嘅.txt同.zip版本
  • 核心financial report係每個.zip文件入面最大嘅.htm文件。呢個「should」係true因為其他content係由主要「.htm」文件extract出嚟嘅

Clean up不必要嘅內容/字元before chunking

好消息係用以上方法,每個report(10K或10Q)而家size不到3 MB。但仲係太長同包含太多我哋唔需要嘅資訊,所以我哋需要喺chunking前再clean up。否則embedding process會run好耐同cost好多錢。想像一下,如果你每個report花$0.1嚟embedding,嗰已經係每間公司約$2.5(過去5年嘅10K同10Q reports)。如果你想cover大部分S&P 500或者extend到過去10年,好快就會加起嚟。

所以呢個係做cleaning嘅script。Process完之後,每個output不到0.2 Mb,細咗10倍。每個文件仲有我哋之前講嘅所有valuable metadata。

而家,我哋準備好做chunking/embedding步驟喇。

我應該用邊個vector store?

呢度有好多options(超過50個vector store選擇)。但因為我喺下一步需要用metadata做advanced filtering,self-querying retriever似乎fit the bill。我講「seem」因為佢哋有disadvantages。我到目前為止試過幾個options,包括Chroma、ElasticSearch同FAISS。

雖然Chroma同ElasticSearch提供robust功能,佢哋嘅index sizes相對較大(Chroma 550+ MB同ElasticSearch 800+ MB)。呢啲indexes只包括5間測試公司嘅embeddings。呢個唔好因為當我scale out到其餘嘅S&P 500,最終index size可以大100倍。同樣,唔適合我嘅local laptop :D

FAISS index,另一方面,同樣5間測試公司只有約200+ MB。但FAISS缺少好多advanced filtering/native integration with Langchain,所以我需要investigate更多。如果你有任何建議,話俾我知。(進一步explore alternative vector stores好似Weaviate或Pinecone可能有幫助?)

Query structuring

雖然include好多metadata入每個filing/chunk同將佢哋作為embeddings嘅一部分儲存喺vector store係好,但我哋點話俾機器知用邊啲filters?Lance from Langchain喺呢個video入面解釋咗「Query structuring for metadata filters」。

我跟住呢個approach create咗下面嘅pydantic object。呢個object包含同financial filings入面實際metadata tags對應嘅fields,例如form types(10-K, 10-Q)、reporting periods、filing dates同company identifiers。

import datetime
from typing import Optional
from pydantic import BaseModel, Field

class FinancialFilingsSearch(BaseModel):
    """Search over a database of financial filings for a company, using the actual metadata tags from the filings."""

    content_search: str = Field(
        ...,
        description="Similarity search query applied to the content of the financial filings with the SEC.",
    )
    conformed_submission_type: str = Field(
        None,
        description="Filter for the type of the SEC filing, such as 10-K (annual report) or 10-Q (quarterly report). ",
    )
    conformed_period_of_report: Optional[datetime.date] = Field(
        None,
        description= "Filter for the end date (format: YYYYMMDD) of the reporting period for the filing. For a 10-Q, it's the quarter-end date, and for a 10-K, it's the fiscal year-end date. ",
    )
    filed_as_of_date: Optional[datetime.date] = Field(
        None,
        description="Filter for the date (YYYYMMDD) on which the filing was officially submitted to the SEC. Only use if explicitly specified.",
    )
    # date_as_of_change: Optional[datetime.date] = Field(
    #     None,
    #     description="If any information in the filing was updated or amended after the initial filing date, this date reflects when those changes were made.",
    # )
    company_conformed_name: str = Field(
        None,
        description="Filter for official name of the company as registered with the SEC",
    )
    central_index_key: str = Field(
        None,
        description="Central Index Key (CIK): A unique identifier assigned by the SEC to all entities (companies, individuals, etc.) who file with the SEC.",
    )
    standard_industrial_classification: Optional[str] = Field(
        None,
        description="he Standard Industrial Classification Codes that appear in a company's disseminated EDGAR filings indicate the company's type of business. Only use if explicitly specified.",
    )
    # irs_number: Optional[str] = Field(
    #     None,
    #     description="IRS number to filter by.",
    # )
    # state_of_incorporation: Optional[str] = Field(
    #     None,
    #     description="State of incorporation to filter by.",
    # )
    # fiscal_year_end: Optional[str] = Field(
    #     None,
    #     description="The end date of the company's fiscal year, which is used for financial reporting and taxation purposes, like Dec 31 or Sep30",
    # )
    form_type: str = Field(
        None,
        description="Form type to filter by, such as 10-K or 10-Q.",
    )
    # sec_file_number: Optional[str] = Field(
    #     None,
    #     description="SEC file number to filter by.",
    # )
    # film_number: Optional[str] = Field(
    #     None,
    #     description="Film number to filter by.",
    # )
    # former_company: Optional[str] = Field(
    #     None,
    #     description="Former company name to filter by.",
    # )
    # former_conformed_name: Optional[str] = Field(
    #     None,
    #     description="Former conformed name to filter by.",
    # )
    # date_of_name_change: Optional[datetime.date] = Field(
    #     None,
    #     description="Date of name change to consider.",
    # )
# Set up language models
llm_35 = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)  # GPT-3.5 model
llm_4 = ChatOpenAI(model="gpt-4-turbo-2024-04-09", temperature=0)  # GPT-4 model for more complex tasks

from langchain_core.prompts import ChatPromptTemplate

system = """You are an expert at converting user questions into database queries. \
You have access to a vector store of financial filings from public companies to the SEC, for building LLM-powered application. \
Given a question, return a detailed database query optimized to retrieve the most relevant results. \
Be as detailed as  possible with your returned query, including all relevant fields and filters. \
Always include conformed_period_of_report. \

If there are acronyms or words you are not familiar with, do not try to rephrase them."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "\{question\}"),
    ]
)

# Assuming `llm` is your already initialized LLM instance
structured_llm = llm_35.with_structured_output(FinancialFilingsSearch)
query_analyzer = prompt | structured_llm

例如,呢個係questions list同LLM嘅responses。你可以見到LLM喺某啲case miss咗「conformed period of reporting」:

Question: What was Google's advertising and marketing spending in the 10-K report for the year 2018?
\{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'central_index_key': '0001652044'\}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2019?
\{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'central_index_key': '0001652044'\}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2020?
\{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'conformed_period_of_report': '2020'\}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2021?
\{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'conformed_period_of_report': '2021'\}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2022?
\{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'form_type': '10-K'\}
Question: How has Google's advertising and marketing spending trended from 2018 to 2022 according to 10-K filings?
\{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'form_type': '10-K'\}

Ok咁我希望用呢篇長(好長)嘅文章achieve到乜?

1. 分享嘅Python codes可以幫到你。

2. 你可以分享返你嘅thoughts/comments或者advise我:

  • 點樣improve我做緊嘅query structuring?FYI,我仲用緊LLM嚟generate同input question相關嘅sub-questions。
  • 點樣improve embedding/vector store嘅使用,特別係filtering部分?
  • 或者任何我冇諗到嘅其他suggestions :)

2024年7月更新

呢個chatbot仲未WORK到,所以如果你試用我網站上嘅chatbot問financial questions,佢唔識 :D

Mind the curiosity gap :)

你有冇做過SEC filings或者試過build self-evaluating agent?我好想聽你點approach vector store同metadata filtering嘅challenges。

祝好,

Chandler

P.S:有個新出嘅course我plan去讀「Generative AI for Software Development Skill Certificate」on Coursera。

繼續閱讀

我嘅旅程
聯繫
語言
偏好設定