In this blog post, I want to share my experience working on a fascinating hobby project—building a financial chatbot/agent with self-evaluation capabilities. This project aims to create an agent that can answer questions about financial conditions and trends across companies in the S&P 500 in the US, using data directly from official SEC filings to ensure accuracy and avoid hallucination. The main benefits of this agent are (or will be :P):
- Allow you to ask questions about financial conditions, and trends across companies in the S&P 500 in the US.
- It uses data directly from the official financial filings of these companies with the SEC, to avoid hallucination
- It provides detailed references below each answer so that you can fact-check the answer if you would like
- The database has the last 5-10 years of financial filing data so you can ask the agent to reason about trends over time
- Before each answer is returned to you, there is another agent, whose job is to critique the draft answer by the LLM and propose how to improve it
- These suggestions and the entire context are then shared with the original agent. The agent can then decide to formulate different queries to retrieve better information from the database or simply incorporate the suggestions into the final answer
- This final answer is then, provided to the user.
Obviously, I do not have access to Bloomberg terminal so I have no idea if the Bloomberg chatbot can already do all of the above. (My educated guess is that it can – to a certain extent. How good are the self-critique part and the revision, I am not too sure but would love to know :P)
Anyway, I wanted to give this a try since it feels complex enough for my learning at this stage and it can be potentially useful.
I want to answer questions like:
- What is Apple’s marketing spend trend over the past 5-10 years?
- What are all of the major acquisitions done by XYZ company over the past 5 years
- Compare the R&D spend of Nvidia and Microsoft over the past 5 years
- etc.
Ok so where I am with this project? What have I learned? What are the struggles?
Update Sep 2024
The MVP version of the S&P500 chatbot is ready. Check out the blog post “S&P500 Agent MVP Launched: Answering Financial Questions Grounded with SEC Data” for more details.
Table of Contents
- How can I get the data from the SEC?
- Sharing actual Python scripts
- Which vector store should I use?
- Ok so what I am hoping to achieve with this long (LONG) post?
- Update Jul 2024
- Share this with a friend
How can I get the data from the SEC?
One of the initial challenges I faced was obtaining the necessary data from the SEC. While the SEC provides guides and documentation on accessing EDGAR data (and here), it took some time to comprehend the process of downloading financial filings for each company at scale.
Is there another way for me to proceed to the next step without having to download and process the filings from the SEC myself?
I looked through Langchain documentation and found that there was a financial retriever called Kay.ai. I tested the retriever to see how the rest of the workflow may work. The retriever works as expected for basic queries. However, it doesn’t support asynchronous calls or advanced metadata filtering. So I decided to continue to explore doing this part by myself.
Sharing actual Python scripts
Python script to download financial filings from SEC EDGAR
After lots of trials and errors, with the help of chatGPT, this is the code that automatically downloads XBRL and TXT filings from the U.S. Securities and Exchange Commission (SEC) database. It is designed to fetch recent filings for specified companies using their Central Index Key (CIK).
Why do I want to download both the .zip file and the .txt file?
The .txt file for each company is super comprehensive. It has a lot of valuable metadata about each filing like the form type (10K or 10Q), the reporting period, filed date, company CIK/name, etc… These are the types of metadata I will need later to build the agent. All of these are neatly captured at the top of the file like:
<SEC-DOCUMENT>0000320193-19-000119.txt : 20191031
<SEC-HEADER>0000320193-19-000119.hdr.sgml : 20191031
<ACCEPTANCE-DATETIME>20191030181236
ACCESSION NUMBER: 0000320193-19-000119
CONFORMED SUBMISSION TYPE: 10-K
PUBLIC DOCUMENT COUNT: 96
CONFORMED PERIOD OF REPORT: 20190928
FILED AS OF DATE: 20191031
DATE AS OF CHANGE: 20191030
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: Apple Inc.
CENTRAL INDEX KEY: 0000320193
STANDARD INDUSTRIAL CLASSIFICATION: ELECTRONIC COMPUTERS [3571]
IRS NUMBER: 942404110
STATE OF INCORPORATION: CA
FISCAL YEAR END: 0928
FILING VALUES:
FORM TYPE: 10-K
SEC ACT: 1934 Act
SEC FILE NUMBER: 001-36743
FILM NUMBER: 191181423
BUSINESS ADDRESS:
STREET 1: ONE APPLE PARK WAY
CITY: CUPERTINO
STATE: CA
ZIP: 95014
BUSINESS PHONE: (408) 996-1010
MAIL ADDRESS:
STREET 1: ONE APPLE PARK WAY
CITY: CUPERTINO
STATE: CA
ZIP: 95014
FORMER COMPANY:
FORMER CONFORMED NAME: APPLE INC
DATE OF NAME CHANGE: 20070109
FORMER COMPANY:
FORMER CONFORMED NAME: APPLE COMPUTER INC
DATE OF NAME CHANGE: 19970808
</SEC-HEADER>
The .txt file has a ton of other data too, besides the core financial report. Because of that, each file is very big (like 10+ MB or even 40+ MB). And for one company, I want to download all of the 10K and 10Q over the past 5 – 10 years so we are talking about easily 20+ files per company. Trying to process/clean up unnecessary characters/chunk these big files is highly unproductive because it will take a long time to do with a normal laptop and also the cost of embedding will be astronomical. So I needed to find another way.
This is where the .zip file comes in. In each of the .zip files, you will find the core financial report in .htm format and other contents. The issue is that these core financial reports are not named in any consistent manner over the years, across companies. And the .htm report doesn’t have all of the valuable meta in a neat format like the .txt file.
Combining the metadata with the main financial report
This is the script that automates the process of extracting financial statements and their associated metadata from the filings of multiple companies. It is designed to handle the files downloaded from the SEC EDGAR database, which includes both .zip and .txt files.
You can see that I made a few assumptions:
- You download both the .txt and the .zip versions for each financial report
- The core financial report is the largest .htm file inside each .zip file. This “should” be true since the other content is extracted from the main “.htm” file
Clean up unnecessary content/characters before chunking
The good news is that using the above approach, each report (either 10K or 10Q) now has a size of less than 3 MB only. But it is still way too long and contains so much information that we do not need so we need to clean it up even more before chunking. Otherwise, the embedding process will run very long and cost a lot of money. Imagine, if you just spend $0.1 per report for embedding, that is already about $2.5/company for 10K and 10Q reports over the past 5 years. If you want to cover most of the S&P 500 or extend the period to the last 10 years, it adds up very quickly.
So here is the script to do the cleaning. After the process, each output is less than 0.2 Mb, a 10x smaller. Each file still has all of the valuable metadata that we talked about earlier.
Now, we are ready for the chunking/embedding step.
Which vector store should I use?
There are many options here (with more than 50 vector store options). But because I need to do advanced filtering in the next step using metadata, a self-querying retriever seems to fit the bill. I say “seem” because they come with disadvantages. I experimented with several options, including Chroma, ElasticSearch and FAISS so far.
While Chroma and ElasticSearch provided robust functionality, their index sizes were relatively large (550+ MB for Chroma and 800+ Mb for ElasticSearch). These indexes only include the embeddings for 5 test companies ONLY. This is not good because as I scale it out to the rest of the S&P 500, the final index size can be 100x larger. Again, not suitable for my local laptop 😀
FAISS index, on the other hand, is only about 200+ MB for the same 5 test companies. FAISS lacks a lot of the advanced filtering/native integration with Langchain though, so I need to investigate more. If you have any suggestions, please let me know via the comment below. (Further exploration of alternative vector stores like Weaviate or Pinecone might be beneficial?)
Query structuring
While it is good that I include a lot of metadata into each filing/chunk and have them as part of the embeddings, stored in the vector store, how do we tell the machine which filters to use? Lance from Langchain explained “Query structuring for metadata filters” in this video.
I followed the approach and created the pydantic object below. This object includes fields that correspond to actual metadata tags found in financial filings, such as form types (10-K, 10-Q), reporting periods, filing dates, and company identifiers.
import datetime
from typing import Optional
from pydantic import BaseModel, Field
class FinancialFilingsSearch(BaseModel):
"""Search over a database of financial filings for a company, using the actual metadata tags from the filings."""
content_search: str = Field(
...,
description="Similarity search query applied to the content of the financial filings with the SEC.",
)
conformed_submission_type: str = Field(
None,
description="Filter for the type of the SEC filing, such as 10-K (annual report) or 10-Q (quarterly report). ",
)
conformed_period_of_report: Optional[datetime.date] = Field(
None,
description= "Filter for the end date (format: YYYYMMDD) of the reporting period for the filing. For a 10-Q, it's the quarter-end date, and for a 10-K, it's the fiscal year-end date. ",
)
filed_as_of_date: Optional[datetime.date] = Field(
None,
description="Filter for the date (YYYYMMDD) on which the filing was officially submitted to the SEC. Only use if explicitly specified.",
)
# date_as_of_change: Optional[datetime.date] = Field(
# None,
# description="If any information in the filing was updated or amended after the initial filing date, this date reflects when those changes were made.",
# )
company_conformed_name: str = Field(
None,
description="Filter for official name of the company as registered with the SEC",
)
central_index_key: str = Field(
None,
description="Central Index Key (CIK): A unique identifier assigned by the SEC to all entities (companies, individuals, etc.) who file with the SEC.",
)
standard_industrial_classification: Optional[str] = Field(
None,
description="he Standard Industrial Classification Codes that appear in a company's disseminated EDGAR filings indicate the company's type of business. Only use if explicitly specified.",
)
# irs_number: Optional[str] = Field(
# None,
# description="IRS number to filter by.",
# )
# state_of_incorporation: Optional[str] = Field(
# None,
# description="State of incorporation to filter by.",
# )
# fiscal_year_end: Optional[str] = Field(
# None,
# description="The end date of the company's fiscal year, which is used for financial reporting and taxation purposes, like Dec 31 or Sep30",
# )
form_type: str = Field(
None,
description="Form type to filter by, such as 10-K or 10-Q.",
)
# sec_file_number: Optional[str] = Field(
# None,
# description="SEC file number to filter by.",
# )
# film_number: Optional[str] = Field(
# None,
# description="Film number to filter by.",
# )
# former_company: Optional[str] = Field(
# None,
# description="Former company name to filter by.",
# )
# former_conformed_name: Optional[str] = Field(
# None,
# description="Former conformed name to filter by.",
# )
# date_of_name_change: Optional[datetime.date] = Field(
# None,
# description="Date of name change to consider.",
# )
However, the output from the LLM has not been consistent, using “with_structured_output” from Langchain. Below is how I set up the query structuring. For the same query, with the change in year, the output sometimes includes the reporting period, sometimes it doesn’t.
# Set up language models
llm_35 = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0) # GPT-3.5 model
llm_4 = ChatOpenAI(model="gpt-4-turbo-2024-04-09", temperature=0) # GPT-4 model for more complex tasks
from langchain_core.prompts import ChatPromptTemplate
system = """You are an expert at converting user questions into database queries. \
You have access to a vector store of financial filings from public companies to the SEC, for building LLM-powered application. \
Given a question, return a detailed database query optimized to retrieve the most relevant results. \
Be as detailed as possible with your returned query, including all relevant fields and filters. \
Always include conformed_period_of_report. \
If there are acronyms or words you are not familiar with, do not try to rephrase them."""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{question}"),
]
)
# Assuming `llm` is your already initialized LLM instance
structured_llm = llm_35.with_structured_output(FinancialFilingsSearch)
query_analyzer = prompt | structured_llm
For example, this is the list of questions and the responses from the LLM. You can see that the LLM misses the “conformed period of reporting” in some cases:
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2018?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'central_index_key': '0001652044'}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2019?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'central_index_key': '0001652044'}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2020?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'conformed_period_of_report': '2020'}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2021?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'conformed_submission_type': '10-K', 'form_type': '10-K', 'conformed_period_of_report': '2021'}
Question: What was Google's advertising and marketing spending in the 10-K report for the year 2022?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'form_type': '10-K'}
Question: How has Google's advertising and marketing spending trended from 2018 to 2022 according to 10-K filings?
{'content_search': 'advertising and marketing spending', 'company_conformed_name': 'Alphabet Inc.', 'form_type': '10-K'}
Ok so what I am hoping to achieve with this long (LONG) post?
1. The shared Python codes help you in any way.
2. You can share back your thoughts/comments or advise me on:
- How to improve the query structuring that I am doing? FYI, I am also using the LLM to generate sub-questions related to an input question.
- An ask to the Langchain team is to make this function “with_structured_output” output to be more consistent. (I know it is a beta feature 😛 for now)
- How can I improve the embedding/usage of the vector store, especially the filtering part?
- or any other suggestions that I have thought of 🙂
Update Jul 2024
This chatbot is NOT working yet so if you try to use the current chatbot on my site and ask financial questions, it doesn’t know 🙂
Mind the curiosity gap
Chandler
P.S: there is a newly release course that I am planning to take “Generative AI for Software Development Skill Certificate” on Coursera.
Share this with a friend
If you enjoyed this article and found it valuable, I’d greatly appreciate it if you could share it with your friends or anyone else who might be interested in this topic. Simply send them the link to this post, or share it on your favorite social media platforms. Your support helps me reach more readers and continue providing valuable content.
[jetpack_subscription_form]
Hi,
Thanks a lot for your work. I’m new to this area. After following your steps below, can you elaborate on “Run the desired script using Python 3.11.”?
Usage
Clone this repository to your local machine.
Modify the BASE_DIR in the scripts to the path where you want to save the filings.
Optionally, update the COMPANY_CIKS dictionary in each script to include the CIKs of the companies you’re interested in.
Run the desired script using Python 3.11.
Quick Start
python sec-filings-downloader.py # Download filings python sec-metadata-processor.py # Process and extract data python financial_filings_htm_cleaner.py # Clean HTML filings
Hi Walter,
It simply means the code in the git repo is compatible with Python 3.11+ version.
Hi Chandler,
I’m a package developer trying to make my package better for RAG. The package converts SEC 10k filings into XML trees where each node is a section (e.g. part, item, or company designated header).
Would this package be useful for your chatbot?
Hi John, thanks for reaching out. I am not sure what you mean by “Would this package be useful for your chatbot?”. I will need to chunk the 10K and 10Q and then generate embeddings for them, store them in a vector store. The issue I am facing is the size of the vector store, given the 500 companies with the last 10 years of 10K and 10Q.
hey thank you for your work, but when you want to go deeper withe the older filings Not all of them available in both XBRL and TXT formats, Missing formats may limit the completeness of your data.
hey Sarra, yeah, I already completed the data cleaning and extraction for all older filings with all of the meta data that I want so I am good there.
hey i’m working on the same project and i need some help can i aske you some question please
sure Sarra. Feel free to ask your questions here. If I can’t answer them, perhaps someone in the community may be able to.