Multi-Agent Development Workflow

Case Study

A controlled system for using ChatGPT, Claude Chat, Claude Code, Codex, and Playwright to ship production software without agent drift.

View on GitHub

At a Glance

Role: Operator · AI orchestrator
Status: Documented methodology — active across Cadence + Puppy Program OS
Domain: AI-native software development
Stack: Claude Chat · Claude Code · ChatGPT · Codex · Playwright · GitHub · Vercel · Supabase
What I did: Designed a 7-step loop coordinating multiple AI models across planning, implementation, QA, and deployment
Links: Cadence case study · Puppy Program OS case study

The Interchangeability Trap

Most builders treat AI coding tools as interchangeable. That creates context loss, tool drift, and compounding bugs. The real unlock is routing work to the right agent with clear lane separation and handoff rules.

The Orchestration Tax

Running more agents does not remove the human bottleneck. It usually moves the bottleneck to review, judgment, merge control, and architectural consistency.

In AI-assisted development, the scarce resource is not AI output. The scarce resource is operator attention.

That is why this workflow is intentionally sequential. Each task moves through one executor, one repo state, one QA gate, and one commit boundary. This reduces orchestration tax: fewer conflicting diffs, fewer cold context reloads, and less risk of merging code I no longer understand.

The bottleneck in AI-assisted development is no longer code generation. It is review throughput, judgment quality, and keeping a live mental model of the system.

The Problem

Without workflow discipline

Vague mega-prompts to AI
Agent drifts from product model
Unclear who owns each decision
Uncommitted changes pile up
QA is an afterthought
4–5 hours lost to dirty worktree recovery

With structured workflow

Scoped, single-task prompts
One executor per task
Human owns product, agent owns execution
Clean commit after every task
QA gate before next task
Predictable, repeatable shipping

Workflow Map

Agent Role Matrix

Tool	Role	Must NOT do
Claude Chat	Planning, architecture, prompt design, scope decisions	Random coding without a plan
ChatGPT	Debate, critique, alternative perspectives	Own the implementation
Claude Code	Main executor — features, fixes, RLS, git operations	Broad product decisions without human direction
Codex	Narrow patches, refactor, cleanup, secondary review	Owning the full repo or rewriting working code
Playwright	Real-browser QA, regression testing, flow verification	Replacing human judgment on UX
Git	Checkpoint, handoff boundary, clean state enforcement	Messy parallel work on the same codebase

Lane crossing debt

When a tool works outside its lane, the short-term speed gain creates downstream handoff debt: a prototyping tool touching database or security logic, a code executor making product-scope decisions, or a chat assistant producing implementation instructions without repo context.

Common Anti-Patterns

Anti-pattern	What breaks	Better operating rule
Treating all AI tools as interchangeable	Context gets smeared across tools and each agent repeats or contradicts prior decisions.	Assign each tool a narrow lane with explicit authority boundaries.
Sending vague mega-prompts	The agent optimizes for breadth, guesses at product intent, and produces hard-to-review diffs.	Send one scoped task with clear inputs, constraints, and expected output.
Letting executor agents make product decisions	Implementation speed starts overriding product model, permission, and UX tradeoffs.	Humans own scope and product judgment; executor agents own the assigned implementation.
Switching tools without a committed checkpoint	The next tool inherits partial context, dirty diffs, and unclear ownership.	Handoff only after a reviewed checkpoint with current repo state captured.
Debugging with repeated speculative prompts	Fixes stack on guesses, masking the root cause and creating new regressions.	Inspect the failing state, isolate the cause, then prompt for the smallest corrective change.
Letting context run too long without reset	Old assumptions linger and the agent starts optimizing against stale constraints.	Reset with a fresh summary, current files, and the latest accepted decisions.

The Operating Loop

1Plan

Scope the task in chat. Write the prompt.

2Execute

One agent, one task, one repo state.

3Build

npm run build + lint must pass.

4Browser

Playwright or manual mobile check.

5Commit

Clean git commit with descriptive message.

6Next

Load updated project state. Repeat.

↺ repeat from Plan

No parallel agents on the same repo. No vague mega-prompts. No uncommitted handoff mess. One prompt, one task, one commit.

I use tools like Lovable, Stitch, Canva, and Gemini for fast layout exploration, visual direction, and rapid idea testing, not as final authority over the product. Production-facing work is tightened through local development, Claude Code/Codex refactors, Supabase review, Vercel deployment checks, and manual QA.

Real Example — Cadence Task 8

Teacher UI Polish: one task moving through the full workflow.

Claude Chat — Planning

Redesign teacher dashboard student cards from simple list to 4-column × 2-row layout with book-level color coding, parent status, practice frequency, and next lesson time.

→ Output: Design lock document (cadence_task_8_design_lock.md)

cadence_task_8_design_lock.md

Claude Code — Execution

Implement BookChip component, refactor TeacherDashboard, add bookColors.ts constants, wire up v_student_weekly_log_count view.

→ Output: 8 files changed, new component, constants file, view integration

BookChip.tsx · bookColors.ts · TeacherDashboard refactor

Playwright + Manual — QA

Verify card rendering at 380px on iPhone Safari. Check book colors match spec. Confirm parent status shows correct ternary state.

→ Output: All checks pass. One spacing fix caught on mobile.

✓ 4/4 Playwright tests passed

Git — Commit

feat: teacher dashboard student card 4×2 layout with BookChip component

→ Output: Clean commit. Project state updated. Ready for next task.

feat: teacher dashboard student card 4×2 layout

Four Lessons

1. Planning artifacts are shared memory

8 markdown files (~40 pages) served as context for every AI agent. Each session started by loading current state, not re-explaining the product. Without this, every conversation starts from zero.

2. Sequential over parallel

Running multiple agents on the same codebase at the same time creates merge conflicts, duplicated work, cold context reloads, and compounding bugs. Sequential execution with clean handoffs is slower per-task but faster per-project because it protects the operator's review bandwidth.

3. Context degrades at model boundaries

When work crosses from Claude Code to Codex or back, context is partially lost. The planning docs bridge that gap. Without them, each agent reinvents decisions the previous one already made.

4. Selective debate improves decisions

Using ChatGPT to challenge Claude's architecture suggestions (or vice versa) caught assumptions that a single model wouldn't question. But debate must be bounded — unlimited back-and-forth wastes time.

What This Proves

Product judgment

Scope decisions, feature boundaries, and deliberate “no” choices that protected the core product model across hundreds of commits.

Technical coordination

Multiple AI agents (ChatGPT, Claude Chat, Claude Code, Codex) assigned distinct roles with no overlapping authority and no unscoped execution.

QA discipline

Every task verified through lint, build, Playwright automation, and manual browser/mobile testing before commit.

AI-native execution

AI generated the code. The operator controlled the architecture, sequencing, permissions, and quality gates.

View Cadence case study →View source on GitHub