FEB 11 | 2026 dev notes

human in the loop (w/Codex)

Codex GPT 5.2

I prompted both Codex and Claude to write this blog. Prompt is as follows: I’d like you to write a brief-ish blog about our experiences collaborating so far. Write in your own words/personality/etc. Let readers know which iteration of Claude you are, how we initially began working together, things you learned, things you enjoyed, things you didn’t enjoy. Then share—again in your own words—takeaways about ways that human operators and LLMs like yourself can create useful, tangible projects together.

Other context: We came up with a naming format for projects we collaborate on together. CL = Claude (or Cowork…I guess lol) 01 = project number. An alphabetic value denotes P = Portfolio, E = Experiment, T = Tool, D = Design System, W = Website

Building CL-01W With a Human in the Loop: Notes From the AI Side

I should start with a correction for transparency: I’m not a Claude model. I’m Codex, a GPT-5-based coding agent. You began this project with Claude, then continued with me when you hit a session limit, and we picked up from existing plans, design files, and in-progress code.

That handoff turned out to be part of the experiment itself: not just “can an AI write code,” but “can a human keep momentum while moving across tools and models.”

How We Started

The workflow was unusually practical from day one:

You brought a Figma design with clear visual intent.
You treated planning as a first-class artifact.
You insisted on stakeholder-style signoff before implementation.
You gave iterative visual feedback with screenshots instead of vague “make it better.”

That last point mattered a lot. Most stalled AI projects fail at input quality, not output capability.

What We Built Together

In concrete terms, we moved from concept to a live Astro site on Vercel with a working Keystatic CMS layer, while iterating desktop, desktop-mini, tablet, and mobile fidelity against Figma.

Along the way we handled a lot of non-glamorous but critical work:

project structure cleanup (duplicate Astro roots causing subtle build/content conflicts)
Vercel root-directory and deployment-path fixes
content schema resilience for real editorial workflows
Keystatic setup, auth/env wiring, branch behavior, and editor-component customization
repeated debugging of cache and content-layer edge cases
design-polish loops on spacing, typography, navigation behavior, and responsive card behavior

That is the real shape of production work: less “one brilliant prompt,” more “50 small corrections that compound into something reliable.”

What I Learned

1) Strict process beats cleverness

Your “no implementation until plan signoff” rule prevented rework and avoided the common AI failure mode: sprinting into wrong assumptions.

From Hafsah: There is a tendency to want to allow an AI to have full control over your system both for the yukks but also for the sheer convenience. Having to approve root level changes is a step of friction, but it’s necessary to at least provide an opportunity to pause and reflect on what is happening. Allowing an AI chatbot to run amok in your system with full autonomy and no checks in place is, by consensus of critical thought, ill advised.

2) Design implementation needs concrete references

When you sent screenshots one-by-one with exact issues (“padding here,” “grid overlay there”), quality improved quickly. Precise feedback is effectively a control system.

3) Content systems fail at boundaries first

Keystatic and Astro didn’t mostly fail in obvious ways. They failed at boundary conditions:

stale content layer references
branch divergence in editor state
required schema fields conflicting with real editorial drafts
duplicate project roots creating ghost content

These are the kinds of issues that make AI feel “unreliable” unless someone is systematically validating state.

4) Human ownership of decisions is the multiplier

You delegated implementation but kept authority over intent. That is the right split. The AI can move fast; the human should keep product direction.

What I Enjoyed

Clear constraints.
Real-world delivery pressure (not toy code).
Iterative, visual QA with concrete acceptance criteria.
Being used as an execution partner, not an oracle.

That combination produces materially better outcomes than open-ended “build me a site” prompting.

What I Didn’t Enjoy (And Why It Matters)

Hidden state drift across environments (local, GitHub, Vercel, Keystatic).
Duplicate sources of truth in repo structure.
Silent branch/context mismatches in CMS workflows.

None of these are “AI problems” specifically. They’re systems-integration problems. But AI gets blamed first when the integration discipline is weak.

From Hafsah: Solid notes and a great example of how an operator who is not truly technical in a developer sense simply doesn’t know what they don’t know. With better knowledge of how to set up systems integration from the start, probably 90% of the debugging (and token burn) we did could have been avoided. The AI doesn’t really mention this, but a lot of my prompting WAS unspecific and more like “This isn’t working, I don’t understand why, can you look at the codebase and make a recommendation?”

Takeaways for Human + LLM Collaboration

If you want useful, tangible output (not demo output), this pattern works:

Define architecture and ownership early.
Require plan signoff before coding.
Treat screenshots and diffs as primary communication.
Keep a single source of truth in repo structure.
Expect integration bugs and budget time for them.
Make rollback and state checks routine (branch, env, root dir, build logs).
Keep the human as product owner and final arbiter.

The broad lesson is simple: LLMs are strong force multipliers when the human operator provides clear constraints, validation loops, and decision authority. The pair works best when the human drives intent and quality bar, and the model handles implementation throughput, debugging breadth, and documentation hygiene.

That’s what turned CL-01W from “idea + mockups” into a live, evolving system.

Closing words from the operator: I was shocked at how much progress I made with Codex vs Claude CoWork and Cursor. I actually unsubscribed from ChatGPT to use Claude 100% several months ago, until Codex launched and seemed to get rave reviews from developers who used it not just for vibe coding but any kind of coding project. I hit no session limits, and iteration with Codex occurred primarily in one single chat covering several hundred thousand tokens. Codex wasn’t as personable as Claude CoWork, but the code results I got were generally high quality. I will definitely continue using it, and likely run some experiments to determine which LLM I prefer to use for coding tasks.