我點樣靠AI Agent從寫code嘅流沙中爬出嚟
我以零coding技能同滿腔熱情投入整chatbot——結果發現我嘅v0.1係一個CSV database加原始chunking嘅災難,直到AI agent救咗我。
更新(2026年): 呢個chatbot演變成咗Sydney!經過好多次迭代,Sydney而家喺/ask/,專注於blog內容同產品。
舊年11月初,我發佈咗我嘅DIY chatbot version 0.1。當時我寫道:「雖然v0.1作為一個coding新手嘅第一步好重要,但佢有好大嘅limitations。」事實證明,呢個係輕描淡寫。我當時唔appreciate成個process同我嘅build有幾clunky。現實係,我嘅初次嘗試雖然真誠,但更似係一個用熱情但有限嘅know-how拼湊出嚟嘅prototype。
呢篇文唔止係一個follow-up;佢係由嗰個點開始展開嘅旅程嘅deep dive——一段充滿試驗、錯誤同無價教訓嘅旅程。我將所有nuts and bolts都show出嚟,唔止係為咗transparency,而係希望我嘅經歷,無論幾detailed,可以同某個走緊類似道路嘅人產生共鳴甚至幫到佢。(補充一下背景,我係一個零coding經驗嘅中年廣告業專業人士。)
如前所述,為咗發佈chatbot v0.1,我主要跟住呢個短期課程「Building Systems with the ChatGPT API」同OpenAI嘅兩個cookbook:Question answering using embeddings-based search同How to count tokens with tiktoken。
以下係點解我嘅chatbot v0.1咁terrible:
- Chunking:我只係根據token長度將長blog post分成細塊,即係最原始嘅static character chunk。最primitive嘅分割方式 :D。如果你想了解點解呢個係terrible idea,睇吓Greg Kamradt嘅5 levels of text splitting。
- Embedding:我struggle with embeddings。我用咗OpenAI嘅embedding model,但不斷撞到API request limits,搞到embedding過程中途fail。之後我學識咗batch requests同喺batches之間加timeout嚟避免limits。最後我將生成嘅embeddings儲存喺一個簡單嘅.csv文件作為我嘅makeshift「database」。
- Database:我知道CSV唔係optimal做database,但缺乏skills去用更好嘅alternatives。
- Metadata:我一開始唔知道包含publish dates同post URLs呢類metadata對chatbot準確回答用戶問題有幾重要。我要重複embedding同saving嚟incorporate相關metadata。
- Retriever:我唔知道有唔同嘅retriever types同algorithms。我只係用咗OpenAI嘅relevance search嚟retrieve一個hardcoded數量嘅結果。
- Memory:要有對話,chatbot需要記住用戶之前講過乜。而呢度就係gpt-3.5有限嘅context window length(當時)嘅明顯trade off——chunk size同你想retrieve幾多結果之間。
- 例如,如果你嘅chunk size係800 tokens而retriever返回top 8結果,即係6,400 tokens或者超過舊model limit嘅50%。
- 以上只係1個問題,所以你可以想像multi-turn conversation,memory可以幾快被填滿。
- 解決呢個問題嘅一個方法係用細啲嘅chunk size同令retriever返回少啲結果,但用basic retriever(上面嘅),意味住model冇足夠comprehensive嘅資訊嚟回答問題。
- 我甚至冇用任何IDE。所有code都係用Mac嘅TextEdit改嘅 :D(我有冇講過我係noob?:P)
- 我可以繼續講落去但我估你已經get到個picture。
我嘅「死亡之谷」
急於改善v0.1嘅limitations,我嘗試咗幾個網上課程,希望佢哋能填補缺失嘅部分嚟提升我嘅skills。但progress只係帶嚟更多dead ends。
我掙扎住做vector databases嘅exercises(Vector Databases: from Embeddings to Applications with Weaviate)、evaluative RAG methods(Building and Evaluating Advanced RAG Applications with Llama-Index同Truera)、同advanced retrieval techniques(Advanced Retrieval for AI with Chroma)。無論我幾努力,都冇辦法將theory同用我自己blog data嘅practical application連接起嚟。
係課程設計得差嗎?唔係——shortcoming係我自己嘅underlying knowledge不足。但一次又一次嘅failure真係massively frustrating,更加唔使講幾demoralizing。我發現自己喺一個figurative valley入面,唔知點樣繼續前行。
逐步嘅小勝利
不過,反覆嘅failures都帶嚟咗一啲nuggets of value:
-
由TextEdit轉用VS Code
-
善用GitHub Copilot extensions
-
Appreciate到Jupyter Notebook作為development environment
最後一個課程提到咗LangChain,一個popular嘅新chatbot building framework。我其實幾個月前已經試過LangChain嘅tutorials(「LangChain: Chat with Your Data」同「Functions, Tools and Agents with LangChain」),冇咩luck。但而家,帶住辛苦掙嚟嘅知識,重新睇佢嘅docs竟然好illuminating。概念click埋一齊,佢嘅modular architecture make intuitive sense。
我可以envision點樣adapt LangChain嘅robust capabilities到我嘅passion project。終於,一條forward path出現咗!一步一步咁,我用佢嘅data ingestion、embedding、storage同retrieval嘅pipelines orient自己。
我嘅confidence隨住每一個implement到嘅piece而增長。V2開始成形⋯⋯
重建基礎
以Langchain為guide,我著手由頭重建我嘅chatbot:
將WordPress匯出導入JSON
經過一番試驗同調整ingestion parameters,LangChain嘅JSONLoader正確咁parse咗我export嘅posts。而家validated嘅input可以feed downstream pipelines。(呢個步驟嘅code喺呢個public Github repo嘅「DataIngestionAndIndexing.ipynb」文件入面。)
自動化Text Splitting
我嗰個naive嘅token length chunking被LangChain嘅SentenceTransformers取代,用advanced NLP嚟split semantic units。唔會再有sentence被喺中間cut斷!Configurations保持chunks喺memory-constrained models嘅適當大小。
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
# Define the token splitter with specific configurations
token_splitter = SentenceTransformersTokenTextSplitter(
chunk_overlap=0, # Overlap between chunks
tokens_per_chunk=256 # Number of tokens per chunk
)
# Split the documents into chunks based on tokens
all_splits = token_splitter.split_documents(documents)
print(f"Total document splits: \{len(all_splits)\}")
生成Embeddings同Indexes
之前同OpenAI API limits嘅struggles用LangChain嘅wrapper for OpenAI embeddings完全消失。兩行code,embeddings就cleanly咁extract到split text嘅salient features。
對於vector store,我揀咗FAISS(Facebook AI Similarity Search)而唔係Weaviate或Chroma。經過industry-tested嘅FAISS喺capability同complexity之間strike到啱嘅balance for my needs。佢嘅CPU版本快速咁index blog chunks,輸出一個compact searchable database。我唔使再擔心batching或者撞到OpenAI嘅API request。
Langchain支持多個vector stores,你可以喺呢度check out。
# Initialize embeddings and FAISS vector store
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(all_splits, embeddings)
# Save the vector store locally
db.save_local("path/to/save/faiss_index") # Placeholder for save path/ index name
兩行code!就係咁。
我仲留意到如果你嘅content唔算少(我有400+篇blog post),咁你唔應該喺embedding之後即刻就用retriever,因為indexes需要啲時間去完成/穩定。
評估Retrievers
Setup呢個retriever只需要1行code :D,用FAISS作為vector store。
retriever = db.as_retriever(search_type="mmr")
有咗我嘅index同retriever,data pipelines準備好去fuel一個intelligent chatbot!
架構Conversational Agents
我決定用langchain嘅agent framework嚟build呢個chatbot。喺呢個階段有冇overkill?係有。但我希望over time,可以evolve呢個chatbot同俾佢更多「tools」aka功能。Langchain令set up agent同俾佢tools超容易。
embeddings = OpenAIEmbeddings()
db = FAISS.load_local("path/to/your/faiss_index_file", embeddings) # Replace the path with your actual FAISS index file path
retriever = db.as_retriever(search_type="mmr")
tool = create_retriever_tool(
retriever,
"search_your_blog", # Replace "search_your_blog" with a descriptive name for your tool
"Your tool description here" # Provide a brief description of what your tool does
)
tools = [tool]
prompt_template = ChatPromptTemplate.from_messages([
# Customize the prompt template according to your chatbot's persona and requirements
])
llm = ChatOpenAI(model_name="gpt-3.5-turbo-1106", temperature=0)
llm_with_tools = llm.bind_tools(tools)
agent = (
\{
"input": lambda x: x["input"],
"agent_scratchpad": lambda x: format_to_openai_tool_messages(x["intermediate_steps"]),
"chat_history": lambda x: x["chat_history"],
\}
| prompt_template
| llm_with_tools
| OpenAIToolsAgentOutputParser()
)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
完整嘅最終python code for the agent
最後,如果你想試chatbot v2,喺呢度。
個chatbot唔認識你Chandler,呢個係咪有啲怪?
P.S:多謝你哋有啲人聯絡我話chatbot唔認識我。你哋係啱嘅!因為我忘記咗export「關於」頁面,只係export咗「已發佈嘅文章」。呢個係第二次我忘記咁做,所以我會將關於我嘅基本問題納入eval questions嘅list。Lesson learned!
多謝大家分享建設性嘅feedback。請繼續提出。同埋我知道chatbot start嘅時候好慢,所以我都喺work on緊呢個問題。:| (我有冇講過我係noob?:P)
一個快速update
Chatbot唔認識我嘅問題而家fix咗。以下係我做咗同學到嘅:
- 由WordPress export「About me」頁面為.XML如上。
- 做text splitting同用FAISS生成embeddings如上。將vector store以唔同名稱save喺local嚟test retriever
# save the vector store to local machine
db.save_local("faiss_index_about")
# set up the retriever to test the new vector store about Chandler
from langchain.retrievers.multi_query import MultiQueryRetriever
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
retriever=db.as_retriever(), llm=llm
)
# test retriever
question = "Who is Chandler Nguyen?"
results = retriever_from_llm.get_relevant_documents(query=question, top_k=8)
for doc in results:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
- 事實證明merge兩個FAISS vector stores嘅過程出奇地簡單,按呢度嘅文檔。
# Try to merge two FAISS vector stores into 1
# load the vector store from local machine
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
db_1 = FAISS.load_local("faiss_index_about", embeddings)
db_2 = FAISS.load_local("faiss_index", embeddings)
db_2.merge_from(db_1)
# save the vector store to local machine
db_2.save_local("faiss_index_v2")
# test the new vector store to confirm correct retrieved documents
retriever = db_2.as_retriever(search_type="mmr")
results = retriever.get_relevant_documents("Who is Chandler Nguyen?")
for doc in results:
print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
- 之後,基本上同上面一樣嘅過程,用新嘅vector store
2月14日更新 Chatbot v2.10發佈
喺chatbot deploy兩星期後,我推出咗version 2.10,以增強速度、擴展性同簡潔性提升用戶體驗。你可以喺呢度閱讀更多。
3月25日更新:由Frontend Upgrades到Docker掙扎同突破
你可以喺呢度閱讀更多。





