对我当前聊天机器人的一次升级
我升级了聊天机器人 Sydney,用来测试 Weaviate 的 hybrid search 与 query structuring——这些能力是我把金融聊天机器人扩展到 500+ 公司前必须跑通的。
更新(2026): 自从这次 Weaviate 实验后,Sydney 已经进化很多。测试过 FAISS、Weaviate 等方案后,我最终在生产版选择了 Supabase pgvector。Sydney 现在基于 Claude,支持流式响应和工具调用。下文提到的 hybrid search 与 query structuring 思路,直接影响了今天 Sydney 的工作方式。
以下保留 2024 年 5 月原文以供参考。
如我在上一篇文章提到的,我正在做一个 可自我评估的 Financial Chatbot。那为什么又绕路去升级当前 chatbot Sydney?主要有几个原因:
- 端到端测试 Weaviate 向量库:
- 我想把 Weaviate 作为金融 chatbot 的主向量库,但当前遇到扩展/内存规模问题。仅用约 20 家公司的过去 10 年 10K 和 10Q filings,weaviate collection 就已经 2GB。按这个做法扩展到整个 S&P 500,collection 会到 50+ GB。体量太大,运行/维护成本也会很高。所以我在测试不同 Product Quantization(PQ)参数。
- 试过多个向量库后,我目前仍倾向 Weaviate,因为它在“metadata filters + hybrid search”上的速度表现很好。
- 在做 Weaviate 的 PQ 时,我也遇到部署层问题:生产环境里应选择 Weaviate Cloud,还是自己上 AWS/Google Cloud?部署复杂度如何?成本会怎样?
- 因为这些问题,我决定先把当前 chatbot Sydney 部署到 Weaviate。也就是把 FAISS 替换成 Weaviate。
- 当前版本我使用的是 Weaviate Cloud
- 实现 query translation 与 query structuring,并确保输出为可用 Json。
- query translation 目标:把输入问题拆成可独立回答的子问题/子任务。
- query structuring 目标:我不仅关心生成用于 hybrid search 的检索词/短语,还关心生成正确的 metadata filters。
- 这点非常关键,因为金融 chatbot 必须在需要时正确按年份、行业等条件过滤。
- 如何在带多个 filters 的情况下进行 hybrid-search,并返回内容和 metadata。
如你所见,以上能力都是金融 chatbot 的必要条件。那我就先在规模更小的 Sydney 上验证它们,完全合理 :)
很高兴你现在可以直接试 Sydney。它已经具备上述能力。你可以问下面这类问题,chatbot 应该会返回相关答案,并附上内容来源博客文章链接。
- What did chandler write about Kevin Rudd in 2020?
- Tell me everything that Chandler wrote about Ray Dalio between 2020 until now
- what did chandler write about Health Savings Accounts in 2022?
- What did chandler do in 2015?
先写到这里。我得回去继续做金融 chatbot 了 :P
如果你也做过 Weaviate 或带 metadata filters 的 hybrid search,很想听听你有哪些实践是有效的。
致敬,
Chandler
2024 年 9 月更新
Sydney 现在是一个多能力 agent,可以:
- 回答当前 S&P 500 公司的相关问题,包括它们过去 10 年向 SEC 提交的内容。
- 提供我过去 15 年博客内容相关的洞察。
看这里 here。
P.S:下面是我用于 query translation 与 query structuring 的 prompt 样例。
"You are a helpful assistant that generates multiple sub-questions related to an input question. "
"The current year is 2024."
"The goal is to break down the input into a set of specific sub-problems / sub-questions that can be answered in isolation. "
"Each specific sub-question will be used to retrieve relevant content from a vector store, using similarity search with score. "
"Phrase the wording of the questions appropriately for this purpose.\n"
"This vector store includes all of the published blog posts from Chandler Nguyen's blog from 2007 to 2024.\n\n"
"Original question: {query}\n\n"
"Generate the minimum number of sub-questions necessary to answer the original question. "
"Your response should be formatted as a JSON array of strings, where each string represents a sub-question. "
"Do not include any additional words, characters, or explanations in the response.\n\n"
"Example response:\n"
'[\n'
' "sub-question 1",\n'
' "sub-question 2"\n'
']'
"""You are a helpful assistant that generates a structured query related to an input question.
The goal is to break down the input into a structured query that can be used to retrieve relevant content from a vector store, using similarity search with score.
This vector store includes all of the published blog posts from Chandler Nguyen's blog from 2007 to 2024.
Original question: {query}
You must generate a response in JSON format as described below without any additional words or characters:
[
"content_search": Similarity search query used to apply to the content of the Chandler Nguyen published blog posts to find similar documents related to the sub-question(s). Ensure the content_search query is not too broad or too specific, and strikes a balance between relevance and completeness. \n
"start_date": optional field, the start date to search for blog posts that are relevant to the sub-question(s) in YYYY-MM-DD format. If the sub-question(s) do not specify a time frame, leave this field blank or set it to the earliest possible date (e.g., 2007-01-01) to cover a broader range. \n
"end_date": optional field, the end date to search for blog posts that are relevant to the sub-question(s) in YYYY-MM-DD format. If the sub-question(s) do not specify a time frame, leave this field blank or set it to the latest possible date (e.g., 2024-12-31) to cover a broader range. \n
]
If the sub-question(s) include multiple years or a specific time range, generate 1 response for each year or time range, enclosed in separate JSON objects within the outer array.
Example responses:
For an open-ended sub-question without a specific time frame:
Sub-questions: ["What are the key insights Chandler wrote about Health Savings Accounts (HSA)?"]
[
"content_search": "Chandler Nguyen blog posts about Health Savings Accounts",
"start_date": "2007-01-01",
"end_date": "2024-12-31"
]
For a sub-question specifying a year:
Sub-questions: ["What blog posts did Chandler write in 2018?", "Which blog posts written by Chandler in 2018 mention Kevin Rudd?"]
[
"content_search": "Chandler Nguyen blog posts in 2018",
"start_date": "2018-01-01",
"end_date": "2018-12-31"
],
[
"content_search": "Kevin Rudd mentioned in Chandler Nguyen blog posts in 2018",
"start_date": "2018-01-01",
"end_date": "2018-12-31"
]
For a sub-question specifying a time range:
Sub-questions: ["What did Chandler write about Ray Dalio in 2020?", "What did Chandler write about Ray Dalio in 2021?", "What did Chandler write about Ray Dalio in 2022?"]
[
"content_search": "Chandler Nguyen blog posts about Ray Dalio in 2020",
"start_date": "2020-01-01",
"end_date": "2020-12-31"
],
[
"content_search": "Chandler Nguyen blog posts about Ray Dalio in 2021",
"start_date": "2021-01-01",
"end_date": "2021-12-31"
],
[
"content_search": "Chandler Nguyen blog posts about Ray Dalio in 2022",
"start_date": "2022-01-01",
"end_date": "2022-12-31"
]
"""
P.P.S: 我知道前端仍然比较慢,所以我可能需要继续补 front end development 。





