我點樣靠AI Agent從寫code嘅流沙中爬出嚟

更新（2026年）： 呢個chatbot演變成咗Sydney！經過好多次迭代，Sydney而家喺/ask/，專注於blog內容同產品。

舊年11月初，我發佈咗我嘅DIY chatbot version 0.1。當時我寫道：「雖然v0.1作為一個coding新手嘅第一步好重要，但佢有好大嘅limitations。」事實證明，呢個係輕描淡寫。我當時唔appreciate成個process同我嘅build有幾clunky。現實係，我嘅初次嘗試雖然真誠，但更似係一個用熱情但有限嘅know-how拼湊出嚟嘅prototype。

呢篇文唔止係一個follow-up；佢係由嗰個點開始展開嘅旅程嘅deep dive——一段充滿試驗、錯誤同無價教訓嘅旅程。我將所有nuts and bolts都show出嚟，唔止係為咗transparency，而係希望我嘅經歷，無論幾detailed，可以同某個走緊類似道路嘅人產生共鳴甚至幫到佢。（補充一下背景，我係一個零coding經驗嘅中年廣告業專業人士。）

如前所述，為咗發佈chatbot v0.1，我主要跟住呢個短期課程「Building Systems with the ChatGPT API」同OpenAI嘅兩個cookbook：Question answering using embeddings-based search同How to count tokens with tiktoken。

以下係點解我嘅chatbot v0.1咁terrible：

Chunking：我只係根據token長度將長blog post分成細塊，即係最原始嘅static character chunk。最primitive嘅分割方式 :D。如果你想了解點解呢個係terrible idea，睇吓Greg Kamradt嘅5 levels of text splitting。
Embedding：我struggle with embeddings。我用咗OpenAI嘅embedding model，但不斷撞到API request limits，搞到embedding過程中途fail。之後我學識咗batch requests同喺batches之間加timeout嚟避免limits。最後我將生成嘅embeddings儲存喺一個簡單嘅.csv文件作為我嘅makeshift「database」。
Database：我知道CSV唔係optimal做database，但缺乏skills去用更好嘅alternatives。
Metadata：我一開始唔知道包含publish dates同post URLs呢類metadata對chatbot準確回答用戶問題有幾重要。我要重複embedding同saving嚟incorporate相關metadata。
Retriever：我唔知道有唔同嘅retriever types同algorithms。我只係用咗OpenAI嘅relevance search嚟retrieve一個hardcoded數量嘅結果。
Memory：要有對話，chatbot需要記住用戶之前講過乜。而呢度就係gpt-3.5有限嘅context window length（當時）嘅明顯trade off——chunk size同你想retrieve幾多結果之間。
- 例如，如果你嘅chunk size係800 tokens而retriever返回top 8結果，即係6,400 tokens或者超過舊model limit嘅50%。
- 以上只係1個問題，所以你可以想像multi-turn conversation，memory可以幾快被填滿。
- 解決呢個問題嘅一個方法係用細啲嘅chunk size同令retriever返回少啲結果，但用basic retriever（上面嘅），意味住model冇足夠comprehensive嘅資訊嚟回答問題。
我甚至冇用任何IDE。所有code都係用Mac嘅TextEdit改嘅 :D（我有冇講過我係noob？:P）
我可以繼續講落去但我估你已經get到個picture。

我嘅「死亡之谷」

急於改善v0.1嘅limitations，我嘗試咗幾個網上課程，希望佢哋能填補缺失嘅部分嚟提升我嘅skills。但progress只係帶嚟更多dead ends。

我掙扎住做vector databases嘅exercises（Vector Databases: from Embeddings to Applications with Weaviate）、evaluative RAG methods（Building and Evaluating Advanced RAG Applications with Llama-Index同Truera）、同advanced retrieval techniques（Advanced Retrieval for AI with Chroma）。無論我幾努力，都冇辦法將theory同用我自己blog data嘅practical application連接起嚟。

係課程設計得差嗎？唔係——shortcoming係我自己嘅underlying knowledge不足。但一次又一次嘅failure真係massively frustrating，更加唔使講幾demoralizing。我發現自己喺一個figurative valley入面，唔知點樣繼續前行。

逐步嘅小勝利

不過，反覆嘅failures都帶嚟咗一啲nuggets of value：

由TextEdit轉用VS Code
善用GitHub Copilot extensions
Appreciate到Jupyter Notebook作為development environment

最後一個課程提到咗LangChain，一個popular嘅新chatbot building framework。我其實幾個月前已經試過LangChain嘅tutorials（「LangChain: Chat with Your Data」同「Functions, Tools and Agents with LangChain」），冇咩luck。但而家，帶住辛苦掙嚟嘅知識，重新睇佢嘅docs竟然好illuminating。概念click埋一齊，佢嘅modular architecture make intuitive sense。

我可以envision點樣adapt LangChain嘅robust capabilities到我嘅passion project。終於，一條forward path出現咗！一步一步咁，我用佢嘅data ingestion、embedding、storage同retrieval嘅pipelines orient自己。

我嘅confidence隨住每一個implement到嘅piece而增長。V2開始成形⋯⋯

重建基礎

以Langchain為guide，我著手由頭重建我嘅chatbot：

將WordPress匯出導入JSON

經過一番試驗同調整ingestion parameters，LangChain嘅JSONLoader正確咁parse咗我export嘅posts。而家validated嘅input可以feed downstream pipelines。（呢個步驟嘅code喺呢個public Github repo嘅「DataIngestionAndIndexing.ipynb」文件入面。）

自動化Text Splitting

我嗰個naive嘅token length chunking被LangChain嘅SentenceTransformers取代，用advanced NLP嚟split semantic units。唔會再有sentence被喺中間cut斷！Configurations保持chunks喺memory-constrained models嘅適當大小。

from langchain.text_splitter import SentenceTransformersTokenTextSplitter
# Define the token splitter with specific configurations
token_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=0,  # Overlap between chunks
    tokens_per_chunk=256  # Number of tokens per chunk
)
# Split the documents into chunks based on tokens
all_splits = token_splitter.split_documents(documents)
print(f"Total document splits: \{len(all_splits)\}")

生成Embeddings同Indexes

之前同OpenAI API limits嘅struggles用LangChain嘅wrapper for OpenAI embeddings完全消失。兩行code，embeddings就cleanly咁extract到split text嘅salient features。

對於vector store，我揀咗FAISS（Facebook AI Similarity Search）而唔係Weaviate或Chroma。經過industry-tested嘅FAISS喺capability同complexity之間strike到啱嘅balance for my needs。佢嘅CPU版本快速咁index blog chunks，輸出一個compact searchable database。我唔使再擔心batching或者撞到OpenAI嘅API request。

Langchain支持多個vector stores，你可以喺呢度check out。

# Initialize embeddings and FAISS vector store
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(all_splits, embeddings)

# Save the vector store locally
db.save_local("path/to/save/faiss_index")  # Placeholder for save path/ index name

兩行code！就係咁。

我仲留意到如果你嘅content唔算少（我有400+篇blog post），咁你唔應該喺embedding之後即刻就用retriever，因為indexes需要啲時間去完成/穩定。

評估Retrievers

Setup呢個retriever只需要1行code :D，用FAISS作為vector store。

retriever = db.as_retriever(search_type="mmr")

有咗我嘅index同retriever，data pipelines準備好去fuel一個intelligent chatbot！

架構Conversational Agents

我決定用langchain嘅agent framework嚟build呢個chatbot。喺呢個階段有冇overkill？係有。但我希望over time，可以evolve呢個chatbot同俾佢更多「tools」aka功能。Langchain令set up agent同俾佢tools超容易。

embeddings = OpenAIEmbeddings()
db = FAISS.load_local("path/to/your/faiss_index_file", embeddings)  # Replace the path with your actual FAISS index file path
retriever = db.as_retriever(search_type="mmr")
tool = create_retriever_tool(
    retriever,
    "search_your_blog",  # Replace "search_your_blog" with a descriptive name for your tool
    "Your tool description here"  # Provide a brief description of what your tool does
)

tools = [tool]
prompt_template = ChatPromptTemplate.from_messages([
    # Customize the prompt template according to your chatbot's persona and requirements
])

llm = ChatOpenAI(model_name="gpt-3.5-turbo-1106", temperature=0)
llm_with_tools = llm.bind_tools(tools)
agent = (
    \{
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_to_openai_tool_messages(x["intermediate_steps"]),
        "chat_history": lambda x: x["chat_history"],
    \}
    | prompt_template
    | llm_with_tools
    | OpenAIToolsAgentOutputParser()
)

agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

完整嘅最終python code for the agent

最後，如果你想試chatbot v2，喺呢度。

個chatbot唔認識你Chandler，呢個係咪有啲怪？

P.S：多謝你哋有啲人聯絡我話chatbot唔認識我。你哋係啱嘅！因為我忘記咗export「關於」頁面，只係export咗「已發佈嘅文章」。呢個係第二次我忘記咁做，所以我會將關於我嘅基本問題納入eval questions嘅list。Lesson learned！

多謝大家分享建設性嘅feedback。請繼續提出。同埋我知道chatbot start嘅時候好慢，所以我都喺work on緊呢個問題。:| （我有冇講過我係noob？:P）

一個快速update

Chatbot唔認識我嘅問題而家fix咗。以下係我做咗同學到嘅：

由WordPress export「About me」頁面為.XML如上。
做text splitting同用FAISS生成embeddings如上。將vector store以唔同名稱save喺local嚟test retriever

# save the vector store to local machine
db.save_local("faiss_index_about")

# set up the retriever to test the new vector store about Chandler
from langchain.retrievers.multi_query import MultiQueryRetriever
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=db.as_retriever(), llm=llm
)
# test retriever
question = "Who is Chandler Nguyen?"
results = retriever_from_llm.get_relevant_documents(query=question, top_k=8)
for doc in results:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")

事實證明merge兩個FAISS vector stores嘅過程出奇地簡單，按呢度嘅文檔。

# Try to merge two FAISS vector stores into 1
# load the vector store from local machine
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
db_1 = FAISS.load_local("faiss_index_about", embeddings)
db_2 = FAISS.load_local("faiss_index", embeddings)
db_2.merge_from(db_1)
# save the vector store to local machine
db_2.save_local("faiss_index_v2")

# test the new vector store to confirm correct retrieved documents
retriever = db_2.as_retriever(search_type="mmr")
results = retriever.get_relevant_documents("Who is Chandler Nguyen?")
for doc in results:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")

之後，基本上同上面一樣嘅過程，用新嘅vector store

2月14日更新 Chatbot v2.10發佈

喺chatbot deploy兩星期後，我推出咗version 2.10，以增強速度、擴展性同簡潔性提升用戶體驗。你可以喺呢度閱讀更多。

3月25日更新：由Frontend Upgrades到Docker掙扎同突破

你可以喺呢度閱讀更多。

我點樣靠AI Agent從寫code嘅流沙中爬出嚟

我嘅「死亡之谷」

逐步嘅小勝利

重建基礎

將WordPress匯出導入JSON

自動化Text Splitting

生成Embeddings同Indexes

評估Retrievers

架構Conversational Agents

完整嘅最終python code for the agent

個chatbot唔認識你Chandler，呢個係咪有啲怪？

一個快速update

2月14日更新 Chatbot v2.10發佈

3月25日更新：由Frontend Upgrades到Docker掙扎同突破

繼續閱讀

升級我而家嘅Chatbot

Chatbot v2.10 發佈：以增強速度、擴展性同簡潔性提升用戶體驗

三個月後：仲喺寫code，仲喺學嘢，仲係（有時）卡住

整一個自我評估嘅金融Chatbot：一段穿越數據、Code同掙扎嘅旅程

我出咗茅招：Sydney而家可以讀10-K Reports入面嘅Narrative

S&P500 Agent MVP Launch咗：用SEC數據回答金融問題