I Rolled Back 2,724 Lines After One AI Audio Change Broke Production
I pushed what should have been an upgrade to my podcast platform's voice system. Six days and several commits later, I deleted 2,724 lines of code and rolled back to what worked. Here is what happened and what it taught me about testing production AI changes.
I meant to write this post sooner. But the whole thing took longer to untangle than I expected, and I wanted to make sure I had the story right before sharing it.
On April 28, 2026, I deployed what I thought was an improvement to DIALØGUE's text-to-speech system. A different voice model looked promising on paper. I had been hearing — and hearing for myself — that the current model's voices drifted on longer podcasts. So I switched over.
By April 30, I had deleted 2,724 lines of code and rolled back to the April 22 stable path.
A real user's podcast was stuck. It could not generate audio. The system had marked it AUDIO_FAILED and then thrown a 409 Conflict on retry, which at first looked like the actual bug. It turned out to be a symptom — the podcast was already in a failed state when the retry happened. The real problem was simpler and worse: a segment timed out on the new configuration. The fallback also timed out. And then the system just gave up.
To make matters worse, this was a script heading toward 65 minutes of spoken audio. The median podcast on the platform is 26 minutes. I accepted a script more than twice that size through a system that had never successfully generated anything close to it. That was the real risk — and the model change is what exposed it.
I have to admit, this was not my best moment as a builder. I pushed a configuration change without the kind of stress testing that a production system deserves. Here is what happened, what broke, and what I learned.
What DIALØGUE does
If you haven't seen it, DIALØGUE is an AI podcast generator. You give it a topic, a PDF, or a show episode, and it produces a two-host conversational podcast. The whole thing runs on Google's Gemini text-to-speech synthesis.
The audio pipeline works like this:
- Generate an outline from the source material
- Expand the outline into a full script
- Split the script into segments
- Synthesize audio for each segment using Gemini TTS
- Stitch the segments together, normalize the loudness, and upload the final MP3
The production default was a Gemini TTS model that worked well for most use cases. But the longer the script, the more you could hear the voice drift between segments. Host A in segment one did not sound exactly like Host A in segment six. For a 25-minute podcast, it is noticeable but you get used to it. For anything approaching 40 minutes, it gets rough.
I decided to try a different model configuration. The idea was to improve voice consistency across longer podcasts.
What I built
Over about a week in late April, I pushed a series of commits to try a different model configuration. This was not a casual one-line change. I built a whole architecture shift:
- Switched the default TTS model to the new configuration
- Added a fallback chain — if the primary model timed out, fall back to the stable production model
- Built a chunk-level QA system: split transcripts into smaller units, synthesize each one, validate audio quality with ffmpeg analysis
- Added workflow progress tracking so the UI could show per-segment synthesis state
- Hardened retry logic — three attempts per chunk, exponential backoff
- Added a long-form audio quality gate that would block
COMPLETEunless the final assembled MP3 passed audio QA
The idea was to move from "generate audio per segment and hope it sounds consistent" to "chunk transcripts, QA each chunk, retry failures, validate the final file."
By April 28, the code was deployed to production. The unit tests passed. The integration tests passed. What I did not have — and should have — was a load test running a full 50+ minute export against the new configuration. I skipped the one test that actually mattered.
What broke
Almost immediately, a production podcast failed.
It was a long one — an analysis of Big Tech capex that was heading toward 65 minutes of spoken audio. The system hit the audio step and returned AUDIO_FAILED.
At first, I saw the 409 Conflict and thought the orchestrator had a bug. It turned out the 409 was a secondary symptom — a retry attempt after the podcast was already marked as failed. The first failure was much simpler.
Segment 0 timed out on the new model configuration. The timeout was set to 60 seconds. The fallback model — which was the same model the stable path uses — also timed out. The system marked the podcast as failed and moved on.
The fallback should have worked. But the fallback was invoked with the same 60-second timeout budget, on the same large segment, in the same request context that had already consumed time on the failed primary call. The stable path, by contrast, calls that same model fresh with the full timeout available and processes segments individually. Same model, different conditions — which is why the stable version handles it fine but the fallback did not.
I disabled the chunk QA system in production with a one-line change. That did not fix the timeout. The new configuration was still timing out on segments that the stable model handled fine.
Well, this is the kind of moment where you have to choose: keep debugging the new path, or go back to what works.
The rollback
On April 30, I made the call.
The rollback commit (4a5bfc8) deleted 2,724 lines of code:
- The entire TTS chunker module
- Audio quality analysis gates
- Workflow progress tracking
- The chunk QA retry test suites
- The fallback model chain configuration
And added 321 lines to preserve the parts that mattered — localized voice prompt phrasing, frontend compatibility, and regression coverage for the stable path.
That is a lot of lines to delete. It felt like admitting defeat, honestly. But it was not. It was choosing the correct long-term path over something that looked like progress but was not ready.
After the rollback, I:
- Rebuilt the shared base Docker image
- Redeployed the
generate-speechservice - Reset the failed podcast back to script approval
- Clicked "Generate Audio" through the normal UI
The result:
- Model: Gemini TTS (stable production config)
- Segments: 6 out of 6 completed
- Duration: 1,527 seconds — about 25 minutes
- Final MP3: 30.5 MB
- Status:
COMPLETE
The podcast that had been stuck for two days finished in about 11 minutes after the rollback. The user's episode shipped — two days later than planned, but complete.
One thing about the 65-minute number: that was the estimated duration based on the raw script length, not the final output. After the reset, the script went through the normal shortening pipeline, and the final assembled audio came out to about 25 minutes. The original script would have been longer still — which is part of why it was failing.
The data
Before deciding what to do next, I wanted actual numbers. Not vibes — data. I queried the production database for podcast duration information.
| Cohort | Count | Median Duration | p90 | Max |
|---|---|---|---|---|
| Last 30 days | 8 | 26 min | 31 min | 34 min |
| Last 90 days | 72 | 27 min | 36 min | 45 min |
| All time | 120 | 26 min | 30 min | 45 min |
So the median podcast is about 26 minutes. The p90 is around 34-36 minutes. The longest completed podcast ever was 45 minutes.
And here is the harder data — podcasts that reached the audio stage, broken down by estimated duration:
| Estimated Duration | Reached Audio | Completed | Failed | In Progress |
|---|---|---|---|---|
| Under 15 min | 18 | 17 | 0 | 1 |
| 15-30 min | 36 | 31 | 2 | 3 |
| 30-40 min | 24 | 23 | 0 | 1 |
| 40-50 min | 3 | 1 | 2 | 0 |
| 50+ min | 2 | 0 | 2 | 0 |
"In Progress" covers podcasts that were started but never completed — abandoned by the user, cancelled mid-generation, or left in a pending state. Two podcasts attempted at 50+ minutes. Both failed at the audio stage. Zero completed.
No completed production podcast has ever reached 50 minutes. The system works well for the 15-40 minute range. Above 40 minutes, the risk goes up sharply. And above 50 minutes, it is essentially untested territory.
What I learned
1. Configuration changes need production-level stress testing
I have to be honest about this one. The new model configuration might have worked fine in short test prompts. But in production, with real scripts and real timeout constraints, it failed on segments that the stable configuration handles without issue.
I did not benchmark this properly before deploying. A short test prompt will succeed. A 65-minute podcast with multiple segments under a 60-second timeout is a completely different context. The gap between those two scenarios is where production incidents live.
I might be wrong about the configuration itself — maybe it just needed more tuning or a longer timeout. But the point stands: I changed a core production dependency without the kind of stress testing that a system people actually rely on deserves.
2. Voice inconsistency is the real problem, not timeouts
The timeout was the visible failure. But the reason I was trying a different configuration in the first place was voice inconsistency. Even on the stable model, voices shift between synthesis calls. Host A in segment one does not sound exactly like Host A in segment six.
For short podcasts, this is barely noticeable. For longer ones, it accumulates. And for 50+ minute podcasts — which, again, nobody has completed in production — it would probably be really obvious.
The chunked approach was trying to solve this by making each chunk smaller and more controlled. That is the right direction, I think. The implementation was just not ready for production yet.
3. I needed telemetry before the incident, not after
When the failure happened, I could not map TTS cost or performance to specific podcast IDs. The logs did not have useful entries. The workflow events had no TTS generation records.
I had to diagnose the failure from the podcast's status field, a confusing 409 error, and local reproduction of the timeout behavior. It worked eventually, but it was not the kind of debugging experience I want to have again.
I added TTS cost telemetry afterward — model used, fallback model, retry count, per-attempt status, transcript characters, output audio bytes, audio duration. This should have existed before the incident. From my experience, this is always the case. You build the observability after the fire, not before.
4. Rolling back is not failure
Deleting 2,724 lines of code felt bad. I will not pretend otherwise. You spend a week building something, you are proud of the architecture, and then you tear it all down because it is not ready.
But it was the right call. The chunk QA system was good design. It will come back — as a smaller, isolated change with proper canary validation. Just not as part of a model configuration change. And not while the new setup still times out on segments that the stable path handles without issue.
5. The 50-minute episode is a different product
This one surprised me. I assumed the system would handle any length of script. The data says otherwise.
If someone genuinely wants a 50-minute podcast, that is a different generation profile. It probably needs a script that gets tightened down to 45 minutes or less before audio generation. A manual review gate before TTS. Stronger per-segment timeout budgets. Maybe even a different synthesis strategy entirely.
Optimizing the default path for the 15-40 minute range is the right call. The 50+ minute episode should be treated as an exceptional path, not the norm. I think that is a product decision, not just an engineering one.
Where things are now
Production is back on the stable TTS configuration. It is not perfect. The voice inconsistency on longer podcasts is still there, and I can hear it. But it is stable enough that the platform works for the vast majority of use cases.
If you want to hear what DIALOGUE produces, you can try it yourself at https://podcast.chandlernguyen.com.
The incident report is committed to the repository. The rollback is documented with the exact SQL shape for resetting a failed podcast. The telemetry is in place now for the next attempt.
And the lesson is clearer than I would have liked:
In production AI, the cost of changing core dependencies without proper stress testing is not a failed experiment. It is a failed user experience.
I will try again with different model configurations. But next time, with better telemetry, a canary deployment path, and a stress test that runs a full 50+ minute export before anything touches production.
Before changing production AI dependencies, here is the checklist I should have followed:
- Stress test the worst-case job. Run the longest, heaviest workload you expect — not just a short test prompt. If your system handles 40-minute podcasts, test with a 50-minute one.
- Log per-attempt telemetry from day one. Model name, fallback model, retry count, timeout, transcript characters, output audio bytes, audio duration. This should have existed before the incident. It always does.
- Canary one real job. Before flipping the switch for everyone, run a single real production job end to end on the new config and verify the output.
- Verify fallback under the same timeout budget. A fallback that shares the same timeout as the primary is not a fallback — it is a second chance to fail under the same conditions.
- Define the rollback SQL and runbook before you deploy. Know exactly how to reset a stuck job. Document the commands. If you do not have this, you will be improvising during the incident.
If you build production AI systems, what is your pattern for testing model configuration changes without breaking users? I would genuinely like to know. Have you had a similar "changed one thing and everything broke" moment?
Cheers,
Chandler





