Skip to content
··阅读时间10分钟

Codex with GPT-5.4 vs Claude Code with Opus 4.6 — Why I Now Use Both

After using Claude Code with Opus 4.6 daily for almost a year, I spent a week with Codex and GPT-5.4. The verdict: neither tool wins outright. The combination — cross-model review, complementary strengths, operational resilience — is better than either alone.

After using Claude Code daily for almost a year, I spent a week with Codex and GPT-5.4. Neither tool wins outright — but using both together, with cross-model review, produces better results than either alone. Here's what I found, from someone who's shipped real products with both.

I've used Claude Code with Opus 4.6 as my primary development tool for everything: rebuilding this site, shipping DIALØGUE to the App Store, building STRAŦUM, translating 3.9 million words across 12 languages, and producing an entire course video pipeline with 18 layout types and word-level audio synchronization.

So when OpenAI launched GPT-5.4 with Codex on March 5, I wasn't looking for a replacement. I was just curious. OpenAI was also running a free month promotion, which made it easy to dive in without commitment.

I was pleasantly surprised. More than pleasantly surprised, actually — Codex with GPT-5.4 was genuinely capable in ways I hadn't expected.

A week later, I'm not switching. I'm dual-wielding. And I think the combination is better than either tool alone.


Why Does the Harness vs. Model Distinction Matter?

Before I get into specifics, there's a nuance I think most comparisons miss.

Claude Code and Codex are harnesses — the CLI tools, the agent orchestration, the plugin ecosystems, the context management, the way they interact with your filesystem and terminal. Opus 4.6 and GPT-5.4 are the underlying models — the intelligence making the actual decisions about what to do, how to reason about a problem, what code to write.

This matters because some of my observations are about the harness and some are about the model. Claude Code's automatic QA dispatch and parallel agent management? That's the harness. GPT-5.4's architectural insight on my fragment sync problem? That's the model. When I talk about cross-model review producing better plans, that's specifically about the models reasoning differently — the harness just delivers the output.

A better model in a worse harness can still feel frustrating. A great harness with a weaker model can feel polished but shallow. Right now, both combinations are strong — but they're strong in different ways.


How Did Codex with GPT-5.4 Compare on First Use?

I'll be honest — I expected Codex to feel like a step down. I've been deep in Claude Code's ecosystem: the Superpowers plugin for structured planning, the parallel agent dispatching, the code review agents that run automatically after implementation. It's a mature workflow.

Codex with GPT-5.4 felt immediately competitive. The model is strong. The reasoning is solid. It follows plans well. When I gave it a well-structured implementation plan, it could work continuously for 45+ minutes without losing the thread — committing, testing, pushing, moving to the next task.

I enabled some of Codex's experimental features early on:

  • Multi-agents — parallel task execution, similar to Claude Code's agent dispatching
  • JavaScript REPL — persistent Node-backed runtime for inline debugging
  • Prevent sleep while running — keeps the machine awake during long sessions

These made a noticeable difference. The multi-agent support in particular felt like Codex was catching up to a workflow I'd grown to depend on in Claude Code.


Where Did GPT-5.4 Outperform Opus 4.6?

I have to admit — there was one area where GPT-5.4 clearly beat Opus 4.6, and it wasn't a small thing.

I've been building a course video pipeline that synchronizes slide fragments with narrated audio. The hard problem isn't timing — ElevenLabs gives us word-level timestamps. The hard problem is alignment: deciding which on-screen fragment should appear when the speaker starts talking about it.

The speaker notes often don't repeat the slide text verbatim. Sometimes the narration paraphrases a bullet, sometimes it combines two bullets into one thought, sometimes a bullet is on the slide but never really spoken. So the system keeps guessing from keywords, which works often enough to look promising but breaks on the harder slides.

Opus 4.6 on medium thinking struggled with this over several sessions. It kept producing increasingly clever heuristics — equal splits by text length, keyword search in timestamps, sentence-level matching, dual-strategy matching — each one better but still fundamentally limited.

GPT-5.4 on high thinking identified the architectural issue: this shouldn't be treated as a keyword-matching problem at all. It should be treated as a data model problem. The renderer should emit the actual fragment states, the assembler should align narration to those states, and validation should flag slides where the visual structure and narration don't match.

It was the right insight. The shift from "guess the sync from text" to "make sync an explicit first-class part of the pipeline" was exactly the architectural reframe the problem needed. And GPT-5.4 got there faster than Opus had.


Where Does Claude Code Still Win?

But here's the thing — execution quality and follow-through are different from architectural insight.

Execution Quality

The clearest example: I asked both tools to audit and improve companion notes across seven course modules. Codex came back and told me the work was done.

Claude Code came back with this:

Audit Complete — All 7 Modules

Tier 1: Fix the 15 existing thin companion notes (bring them up to standard — quick win)

Tier 2: Add companion notes to ~25-30 high-priority slides (core frameworks, tool lists, multi-step processes, dense stats)

That's not "done." That's a structured gap analysis identifying 40-45 slides that need attention, prioritized into tiers. The difference between "I completed the task" and "I completed the task and here's what I found" is significant when you're shipping a real product.

Automatic QA

This is Claude Code's killer feature and I don't think people talk about it enough. After completing a chunk of implementation, Claude Code automatically dispatches QA agents — code review, narrative review, consistency checks — without me asking. It's built into Claude Code itself.

Codex doesn't do this yet. When Codex says it's done, you need to verify manually or set up your own review process. With Claude Code, the verification is part of the workflow. Brilliant.

Parallel Agent Management

Claude Code's agent orchestration is more mature. It dispatches multiple specialized agents, manages their results, synthesizes findings, and presents a coherent summary. I've had sessions with 5-6 agents running concurrently — an explorer, a code reviewer, an implementation agent, a test runner — all coordinated.

Codex's multi-agent support is promising but earlier in its development. It works, but the coordination isn't as seamless yet.

Consistency

Over long sessions with many moving parts — like producing slides across 7 modules with 18 layout types — Claude Code maintains consistency better. The design tokens stay correct, the naming conventions hold, the architectural decisions made in the first hour are still respected in the fourth hour.


Can You Cross-Pollinate Workflows Between Tools?

One workflow I didn't expect: using Codex to inspect Claude Code's plugin ecosystem and adapt it.

I particularly like several Claude Code plugins: the feature development workflow (/feature-dev), the code review system (/code-review), the code simplifier (/code-simplifier), the Superpowers planning framework (/superpowers), and the frontend design skill (/frontend-design). These are well-designed workflows that encode best practices into the tool.

So I asked Codex to study them and create equivalent Codex skills:

"I'm writing user-level Codex skills under ~/.codex/skills, using the Claude plugin workflows as the template and adapting them to Codex's skill model where Claude-only features like hooks or plugin commands don't exist."

It worked. Not perfectly — some Claude Code concepts don't have direct Codex equivalents — but the core workflows translated well. Now I have structured development processes in both tools, informed by the same design philosophy.


What Happens When You Use Both Models to Review Each Other?

Here's what I think is the most valuable discovery from this week.

Having Opus 4.6 critically review a plan from GPT-5.4, and then having GPT-5.4 review the revised plan from Opus — running this back and forth for a few rounds — produces significantly better results than either model working alone.

They find different weaknesses. Opus tends to catch architectural inconsistencies and edge cases in error handling. GPT-5.4 tends to catch over-engineering and suggests simpler approaches. They complement each other's blind spots.

I've started doing this for any non-trivial implementation plan: draft in one tool, review in the other, revise, review again. Two or three rounds. The final plan is tighter, more robust, and catches issues that neither model surfaced on its own.

If you're only using one AI coding tool, you're leaving quality on the table. Not because either tool is bad — both are genuinely excellent — but because they reason differently, and different reasoning catches different problems.


What Happens When One Tool Goes Down?

On March 11, Claude Code experienced elevated errors — login issues, slow performance, intermittent failures. For several hours, it was effectively unusable.

Anthropic status page showing elevated errors on Claude.ai including login issues for Claude Code on March 11, 2026
Claude Code experienced elevated errors on March 11, 2026 — login issues and slow performance for several hours.

Because I'd already been ramping up with Codex, I switched over almost completely. And I was fine. I had my Codex skills set up, my workflows translated, and GPT-5.4 handled the work capably.

That experience crystallized something: depending entirely on one tool is a risk. Not because the tool is unreliable — Claude Code has been remarkably stable for the year I've used it — but because any service can have a bad day. Having a second tool you're genuinely comfortable with isn't a luxury. It's operational resilience.


Side-by-Side: Claude Code vs. Codex at a Glance

DimensionClaude Code (Opus 4.6)Codex (GPT-5.4)
Execution qualityDeeper — finds gaps, prioritizes workGood — completes tasks, less proactive analysis
Automatic QABuilt-in, dispatches review agents automaticallyNot yet — manual verification needed
Parallel agentsMature — 5-6 coordinated agentsPromising — works but less seamless
Architectural reasoningStrong on medium thinkingExcellent on high thinking — faster reframes
Sustained plan executionGoodImpressive — 46+ minutes continuous
Context compactionSlowerFaster — different, not necessarily better
Localization at scaleOn par (Opus 4.6 medium)On par — currently cheaper
Plugin/skill ecosystemMature (Superpowers, /feature-dev, etc.)Growing — can adapt Claude workflows
Cross-model reviewCatches edge cases, inconsistenciesCatches over-engineering, suggests simplification
Cost$100-200/monthFree month promo, then TBD
Codex CLI terminal showing GPT-5.4 executing a development plan continuously for 46 minutes — committing, testing, and pushing code
Codex with GPT-5.4 executing a confirmed plan for 46+ minutes straight — committing, testing, fixing, pushing.

A Few More Observations

Context management: Codex seems to compact context faster once the context window fills up. I haven't decided yet whether that's better or worse — it's just noticeably different from how Claude Code handles it.

Localization at scale: I translated 3.9 million words across 12 languages using Claude Code with Opus 4.6. GPT-5.4's translation quality is on par with Opus 4.6 medium thinking — and for now, it's cheaper for me to run at scale. So I've been shifting my bulk localization work to GPT-5.4. I'm not sure how long that cost advantage lasts, but while it does, it makes sense to use it.

Cost: I'm on Claude Code's $200/month Max plan. With Codex handling a meaningful share of my workload now — especially localization — I'm considering dropping to the $100 tier. OpenAI's free month helps with the transition, but even at full price, splitting across two tools might be more cost-effective than maxing out one.


Where I've Landed

After a week of genuine dual-wielding, here's my working model:

Reach for Claude Code when: you need execution quality with built-in QA, complex multi-agent orchestration, long consistency across large codebases, or you're working in a project where the Superpowers workflow is already set up.

Reach for Codex when: you need a fresh architectural perspective, you want high-thinking mode on a hard reasoning problem, you're executing a well-defined plan that benefits from sustained uninterrupted work, or Claude Code is having a bad day.

Use both for: any non-trivial implementation plan. Draft in one, review in the other. The cross-model review loop is genuinely the best workflow I've found.

I'm not abandoning Claude Code — it's still my primary tool and the ecosystem I know best. But I'm no longer a single-tool developer. GPT-5.4 earned its place in my workflow through genuine capability, not just as a backup.

The future of AI-assisted development isn't picking a winner. It's knowing when to reach for which tool, and — more importantly — knowing that the tools are better together than apart.


Frequently Asked Questions

Is GPT-5.4 better than Opus 4.6 for coding?

Neither is strictly better. GPT-5.4 on high thinking excels at architectural reasoning and sustained plan execution. Opus 4.6 excels at execution quality, proactive gap analysis, and consistency over long sessions. The best results come from using both models to review each other's work.

Should I switch from Claude Code to Codex?

I wouldn't recommend a full switch. Both tools have distinct strengths — Claude Code's automatic QA and parallel agent orchestration are genuinely ahead, while Codex's sustained execution and GPT-5.4's reasoning on hard problems are impressive. The dual-wielding approach gives you the best of both.

Is the cross-model review workflow worth the extra effort?

For non-trivial plans, absolutely. Having Opus review GPT-5.4's output and vice versa catches different categories of issues — Opus finds edge cases and inconsistencies, GPT-5.4 catches over-engineering. Two or three rounds produces noticeably tighter plans than either model alone.

How much does a dual-tool setup cost?

Claude Code runs $100-200/month depending on the plan. Codex pricing varies — OpenAI is currently offering a free month promotion. Even at full price, splitting workload across two tools at lower tiers may be more cost-effective than maxing out one.

Can you use Claude Code plugins in Codex?

Not directly, but you can adapt them. I used Codex to inspect Claude Code plugin workflows (/feature-dev, /code-review, /superpowers) and translate the core logic into Codex skills under ~/.codex/skills. Some Claude-specific features like hooks don't translate, but the workflows do.


That's it from me. I'm curious — are others running multi-model workflows? Using Claude Code and Codex together, or other combinations? What patterns are you finding?

Cheers, Chandler

继续阅读

产品我的旅程
联系
语言
偏好设置