The Code Was the Easy Part of Shipping Prova

Prova started feeling real to me the moment the product stopped letting me lie to myself.

A homepage can stay aspirational for a long time.

A live product can't.

Once people can sign up, confirm their email, land in the wrong place, hit billing state, submit work, and expect the system to remember what happened, the bottleneck changes.

The product starts making real promises, and the invisible work gets very visible.

Update on April 16, 2026: since I drafted the first version of this post, Prova has already shifted toward a clearer Operator/Builder split. I have kept the main thesis intact and updated the product-specific details below so this stays aligned with what is live now.

If you saw the three posts I wrote in March and early April about switching from Claude Max to Codex, that experiment is still running. I already promised a proper follow-up on May 2, 2026. This is not that post.

If you have not seen Prova before: it is my coaching product for marketers and advertising professionals, with an Operator path for workflow redesign and a Builder path for shipping a first useful slice.

The hard part, from my experience in this push, was not getting an LLM to generate code. It was everything around the code.

If you are building with AI right now — something with real users, real money, or real product state — this is probably the part nobody warns you about.

Naming Was The Easy Part

I have to admit, naming Prova was the easy decision. Prova means proof or test, which matches the product's actual job: making people prove their work, not flattering them into feeling like builders.

The harder part began right after. The moment Prova stopped being a name and a landing page, it started demanding all the unglamorous things real products demand:

assessment logic
onboarding state
authored progression
review visibility
billing flows
auth hardening
content publishing
release gates

That changed how I think about both AI tools and product building.

What Prova Actually Had To Become

Prova is now a more opinionated product than it was when I first started drafting this post. The surface has split into two clearer entry paths: an Operator path for people redesigning workflows and a Builder path for people trying to ship a first useful slice. On the builder side, that now means a Build Reality Check, a Build Brief, a Build Plan, an execution lane while the build is in motion, and a launch gate before showing it to a real user.

Underneath that cleaner surface, the harder system work is the same point I was making when I drafted this. It is still assessment and onboarding, a sprint-first product rather than a chat-first one, review-driven progression, durable product state, and a mentor layer that behaves more like office hours around the current sprint than a generic AI chat box.

To be clear, I am not claiming Prova is "done." I wish I could say that and move on, but that would be a bit too convenient, and probably untrue. The right framing today is still controlled launch: billing portal and recovery flows are live, but the open work has shifted from basic product plumbing toward making the Builder path, execution lane, and launch gate more honest under real usage. I actually think that makes the lesson more useful, not less. This is what product building looks like in the real world. Not finished. Not fake. Just real enough that the next mistakes will cost something.

The Work I Couldn't Generate

The easiest version of an AI product is the demo version.

You can make it look alive quickly:

a homepage
a chat box
a few slick screenshots
a persuasive product sentence

That version is useful for a launch teaser.

It is not useful for understanding the real work.

For me, the real work started showing up in six places.

1. The sprint system had to stop pretending

I did not want Prova to be another product that looks structured on the landing page and becomes generic the moment a user gets inside.

Fixed authored sprints were useful until they weren't.

The real problem appeared when a learner passed a sprint, but the next real gap was not the next authored sprint in the catalog.

The lazy answer would have been:

"Let the model invent the next sprint: title, rubric, review criteria, everything."

I did not trust that.

So the system had to become stricter than that.

Now the reviewer can still recommend the canonical next sprint when the authored path is correct. It can assign an adaptive detour, a temporary sprint that helps the learner fix a foundation issue before returning to the main path, when a known foundation gap appears. And when the next real gap sits outside the authored catalog, it can request a composed sprint instead.

That sounds small in a blog post.

It is not small in a product.

Because a composed sprint is not just "dynamic text."

It has to create a placeholder sprint, persist the assignment, preserve the return path to the canonical roadmap, fill the sprint packet, and then let the rest of the system treat that sprint like real product state.

The constraint that made this trustworthy was simple:

the model does not get to invent the standard.

The composed sprint only fills the packet: title, summary, context, assignment, example.

The submission schema and review rubric still come from an authored template sprint.

That was one of the biggest lessons in the whole build.

If you want something dynamic to stay trustworthy, keep the contract fixed and let the model operate inside it.

This was also a big place where Codex helped me. Not in a flashy "look what it generated" way. In a more useful way: migrations, router logic, placeholder lifecycle, request state, integration tests, and all the boring contract work that makes a dynamic product less fake.

I also have to admit I was tempted, at least briefly, by the lazier version. Let the model be clever. Let the system feel magical. Then clean it up later. That idea sounds great right up until you imagine explaining a bad sprint assignment to a paying user.

2. Retrieval had to become curricular, not just semantic

Once I let the system compose sprints, retrieval quality stopped being a backend detail.

It became curriculum quality.

If the product was going to compose a sprint on tooling, measurement, or competitive intelligence, it could not just grab "kind of related" chunks from the knowledge base and hope the packet felt coherent.

So the knowledge base had to get smarter.

We rebuilt it from the real corpus: blog posts, course modules, transcripts, companions, templates, and deep-dive resources.

Then we chunked those sources differently depending on what they were.

Narrative blog posts could stay larger.

Instructional modules needed heading-first instructional chunks.

Templates and reference assets needed structured chunking that preserved tables, checklists, slide references, and quick-reference sections instead of flattening everything into generic paragraphs.

Then tagging mattered.

Not just "what is this about?" but:

what kind of chunk is this
which topic tags fit it
which audience it is for
what difficulty level it belongs to

That changed retrieval from "nearest vector wins" into something closer to curricular matching.

The product could ask for something like:

"Give me grounded material for this topic, for this audience, around this level."

That is a much more serious system than just bolting RAG onto a demo.

3. Evals had to test the pipeline, not just the model

I did not want to declare victory because one composed sprint looked decent in the UI. So the eval system had to grow with the product.

First, a sprint review eval harness. Then composition added a retrieval gate — a pass/fail check on whether tagged retrieval was actually improving the grounded material coming back versus the untagged baseline. Then a full end-to-end harness: compose the sprint packet, generate schema-valid synthetic submissions, run the real reviewer against those submissions, and judge results across grounding, audience differentiation, difficulty gradient, curriculum coherence, review quality, and comparison against authored work.

The point was to stop "personalized progression" from turning into fake personalization. If the system was going to say, "this next sprint is actually for you," it needed more than a decent-looking packet. It needed evidence:

did the packet stay grounded
did track framing actually change the work product
did level 2 versus level 5 feel materially different
did the reviewer still make the right pass/revise call
did the composed sprint still feel like it belonged inside the curriculum instead of floating next to it

That loop exposed real problems. Weak synthetic revise submissions that were still too strong. Audience framing that collapsed across tracks. Measurement packets that sounded fine but did not actually answer different trust questions for different learners.

One of the more humbling moments for me was realizing how often something could look "pretty good" in the UI and still fail the moment I asked a stricter question. A packet could read smoothly and still be too generic. A revise case could feel plausible and still be too easy. That was useful for my ego, I think.

The system did not get better because I had one good prompt. It got better because the product could fail in public-to-me, inside an eval harness, before it failed in public-to-users. It was not one miracle. It was a controlled loop.

4. Trust surfaces had to be real

This is another place where real products separate themselves from demos.

Nobody cares how elegant your AI layer is if:

email confirmation breaks
Google sign-in is flaky
the user gets trapped in an auth loop
abuse controls are missing
account hardening feels like an afterthought

For Prova, the credibility layer ended up including:

email confirmation that lands people back in the right product flow
Google SSO
optional app-based two-factor auth (TOTP MFA)
Turnstile protection on auth and assessment

That is the kind of work people rarely brag about because it sounds operational and boring.

I think that is a mistake.

Operational and boring is exactly where products either earn trust or quietly lose it.

5. Billing had to be testable, not just imaginable

I am increasingly skeptical of product builders who talk about monetization as if it is one Stripe screenshot away from being solved.

Billing gets real very quickly.

It is not just:

"Can someone technically pay?"

It is:

what happens during trial state
what happens when webhook state arrives late
how do you test safely without performing theater on your own revenue surface
how do you verify the whole flow without inventing one-off exceptions that make the system less reliable later

Prova now has a real subscription flow plus an internal QA checkout path specifically for safer billing verification. That may sound minor. It isn't. The more serious the product gets, the more you need ways to test commercial logic without lying to yourself about what has and has not been verified.

6. Operations had to become part of the product

One thing that made Prova feel real to me was having to treat it as a different operational system, not a feature tucked inside my existing site. Separate production system, separate Supabase project, separate Vercel project, separate auth configuration, separate billing setup, separate release gates.

That sentence is not sexy. It is also the sentence that probably matters most.

Because products do not break only at the UI layer. They break where systems touch each other, where assumptions leak, where environments drift, where somebody says "we will fix that after launch" and then the after-launch version never really comes.

The more I worked on Prova, the more I respected boring operational sentences.

If you want a fast pressure test for your own AI product, ask:

what standard stays fixed when the model gets dynamic
what retrieval keeps the output grounded
what eval catches fake personalization before a user does
which auth, billing, or state edge would embarrass you tomorrow

If those answers are fuzzy, the product is probably still more demo than system.

The Proof That Actually Matters

I do not want to talk about productization only in abstractions, because that makes it sound more philosophical than it really is.

The proof that matters most is not "how many migrations did I write?"

It is what now works for a real user.

As of this soft-launch phase, the product can now credibly say:

email signup, confirmation, and password recovery work
Google SSO works
optional TOTP MFA works
downloadable resources and signed downloads work
sprint submission and AI review work in production
Stripe checkout handoff and billing portal access work

That is the buyer-facing layer of proof.

Underneath that, there is system proof from the repo and rollout work too:

4,390 production knowledge-base rows published and tagged for composition retrieval
a tagged retrieval gate beating the untagged baseline
a full 18-case composition eval suite passing its gate
a safer internal QA checkout path for billing verification

What I do not have yet is broad external proof from a meaningful batch of users saying, "this changed outcomes for me." It is too early for that, and I do not want to fake it.

I am listing them because it is very easy to underestimate what "turning an idea into a product" actually means when you only talk in launch language.

The important part is not the counts by themselves.

It is what the counts represent:

the sprint system stopped being decorative
retrieval stopped being fuzzy
evaluation stopped being performative
product state became durable enough to rely on

The invisible work is still work.

And usually it is the harder work.

What This Changed In My Head

I still think AI tools matter. Of course they do. Codex was useful in this push because it felt competent, careful, and methodical — good at helping me move through engineering work without a lot of emotional theater. Claude still feels better to me when the work gets more taste-heavy and creative. The full 30-day Codex verdict is still coming on May 2, 2026, and that post will be the right place for the cost timeline, the limits story, and what changed in my day-to-day working model.

But Prova pushed me toward a more important conclusion than any tool comparison.

AI can accelerate implementation. It does not remove the need for judgment about product boundaries, sequencing, credibility, and operational truth. That judgment is still the work.

A landing page is not a product. A named concept is not a product. A generated component is definitely not a product.

What starts making something feel like a product is the invisible architecture around the visible experience. The moment real users can enter, pay, progress, get blocked, recover, submit work, and trust the state they see, you are no longer playing with a demo. You are making promises.

Promises are where product building becomes serious. That is the part I respect more now, and I think that is the part more builders should talk about.

Frequently Asked Questions

What is Prova?

Prova is now a structured coaching product for marketers and advertising professionals who want a more serious path into AI work. It currently has two clearer entry paths. The Operator path is for workflow redesign, measurement, and rollout judgment. The Builder path is for shipping a first useful slice, and now moves through a Build Reality Check, a Build Brief, a Build Plan, an execution lane, and a launch gate. Underneath both paths, the real product is still assessment, onboarding, sprint progression, review logic, resources, billing, auth, and product state working together rather than one AI chat box pretending to be the whole system.

What is a composed sprint in Prova?

A composed sprint is a sprint packet generated for a real learner gap that sits outside the current authored catalog. The important constraint is that the model does not invent the standard by itself. The packet is dynamic, but the submission schema and review rubric still come from an authored template sprint.

Why does AI product building get harder after the demo?

Because the moment real users can sign up, pay, recover from errors, submit work, and trust the state they see, the product starts making promises. At that point the invisible systems around the AI layer, such as auth, retrieval quality, evals, billing, and operations, matter more than whether the first output looked impressive in a demo.

If you are building with AI right now, I would be genuinely curious:

what was the first thing that made your project feel uncomfortably real, the moment a demo turned into a promise?

That's it from me for now.

Cheers, Chandler

Tagged#prova #ai-coding #product-building #build-in-public