
Multi-Agent Development Workflow
Case StudyA controlled system for using ChatGPT, Claude Chat, Claude Code, Codex, and Playwright to ship production software without agent drift.
At a Glance
- Role
- Operator · AI orchestrator
- Status
- Documented methodology — active across Cadence + Puppy Program OS
- Domain
- AI-native software development
- Stack
- Claude Chat · Claude Code · ChatGPT · Codex · Playwright · GitHub · Vercel · Supabase
- What I did
- Designed a 7-step loop coordinating multiple AI models across planning, implementation, QA, and deployment
The Interchangeability Trap
Most builders treat AI coding tools as interchangeable. That creates context loss, tool drift, and compounding bugs. The real unlock is routing work to the right agent with clear lane separation and handoff rules.
The Orchestration Tax
Running more agents does not remove the human bottleneck. It usually moves the bottleneck to review, judgment, merge control, and architectural consistency.
In AI-assisted development, the scarce resource is not AI output. The scarce resource is operator attention.
That is why this workflow is intentionally sequential. Each task moves through one executor, one repo state, one QA gate, and one commit boundary. This reduces orchestration tax: fewer conflicting diffs, fewer cold context reloads, and less risk of merging code I no longer understand.
The bottleneck in AI-assisted development is no longer code generation. It is review throughput, judgment quality, and keeping a live mental model of the system.
The Problem
Without workflow discipline
- Vague mega-prompts to AI
- Agent drifts from product model
- Unclear who owns each decision
- Uncommitted changes pile up
- QA is an afterthought
- 4–5 hours lost to dirty worktree recovery
With structured workflow
- Scoped, single-task prompts
- One executor per task
- Human owns product, agent owns execution
- Clean commit after every task
- QA gate before next task
- Predictable, repeatable shipping
Workflow Map
Agent Role Matrix
| Tool | Role | Must NOT do |
|---|---|---|
| Claude Chat | Planning, architecture, prompt design, scope decisions | Random coding without a plan |
| ChatGPT | Debate, critique, alternative perspectives | Own the implementation |
| Claude Code | Main executor — features, fixes, RLS, git operations | Broad product decisions without human direction |
| Codex | Narrow patches, refactor, cleanup, secondary review | Owning the full repo or rewriting working code |
| Playwright | Real-browser QA, regression testing, flow verification | Replacing human judgment on UX |
| Git | Checkpoint, handoff boundary, clean state enforcement | Messy parallel work on the same codebase |
Lane crossing debt
When a tool works outside its lane, the short-term speed gain creates downstream handoff debt: a prototyping tool touching database or security logic, a code executor making product-scope decisions, or a chat assistant producing implementation instructions without repo context.
Common Anti-Patterns
| Anti-pattern | What breaks | Better operating rule |
|---|---|---|
| Treating all AI tools as interchangeable | Context gets smeared across tools and each agent repeats or contradicts prior decisions. | Assign each tool a narrow lane with explicit authority boundaries. |
| Sending vague mega-prompts | The agent optimizes for breadth, guesses at product intent, and produces hard-to-review diffs. | Send one scoped task with clear inputs, constraints, and expected output. |
| Letting executor agents make product decisions | Implementation speed starts overriding product model, permission, and UX tradeoffs. | Humans own scope and product judgment; executor agents own the assigned implementation. |
| Switching tools without a committed checkpoint | The next tool inherits partial context, dirty diffs, and unclear ownership. | Handoff only after a reviewed checkpoint with current repo state captured. |
| Debugging with repeated speculative prompts | Fixes stack on guesses, masking the root cause and creating new regressions. | Inspect the failing state, isolate the cause, then prompt for the smallest corrective change. |
| Letting context run too long without reset | Old assumptions linger and the agent starts optimizing against stale constraints. | Reset with a fresh summary, current files, and the latest accepted decisions. |
The Operating Loop
Scope the task in chat. Write the prompt.
One agent, one task, one repo state.
npm run build + lint must pass.
Playwright or manual mobile check.
Clean git commit with descriptive message.
Load updated project state. Repeat.
No parallel agents on the same repo. No vague mega-prompts. No uncommitted handoff mess. One prompt, one task, one commit.
I use tools like Lovable, Stitch, Canva, and Gemini for fast layout exploration, visual direction, and rapid idea testing, not as final authority over the product. Production-facing work is tightened through local development, Claude Code/Codex refactors, Supabase review, Vercel deployment checks, and manual QA.
Real Example — Cadence Task 8
Teacher UI Polish: one task moving through the full workflow.
Redesign teacher dashboard student cards from simple list to 4-column × 2-row layout with book-level color coding, parent status, practice frequency, and next lesson time.
→ Output: Design lock document (cadence_task_8_design_lock.md)
cadence_task_8_design_lock.md
Implement BookChip component, refactor TeacherDashboard, add bookColors.ts constants, wire up v_student_weekly_log_count view.
→ Output: 8 files changed, new component, constants file, view integration
BookChip.tsx · bookColors.ts · TeacherDashboard refactor
Verify card rendering at 380px on iPhone Safari. Check book colors match spec. Confirm parent status shows correct ternary state.
→ Output: All checks pass. One spacing fix caught on mobile.
✓ 4/4 Playwright tests passed
feat: teacher dashboard student card 4×2 layout with BookChip component
→ Output: Clean commit. Project state updated. Ready for next task.
feat: teacher dashboard student card 4×2 layout
Four Lessons
1. Planning artifacts are shared memory
8 markdown files (~40 pages) served as context for every AI agent. Each session started by loading current state, not re-explaining the product. Without this, every conversation starts from zero.
2. Sequential over parallel
Running multiple agents on the same codebase at the same time creates merge conflicts, duplicated work, cold context reloads, and compounding bugs. Sequential execution with clean handoffs is slower per-task but faster per-project because it protects the operator's review bandwidth.
3. Context degrades at model boundaries
When work crosses from Claude Code to Codex or back, context is partially lost. The planning docs bridge that gap. Without them, each agent reinvents decisions the previous one already made.
4. Selective debate improves decisions
Using ChatGPT to challenge Claude's architecture suggestions (or vice versa) caught assumptions that a single model wouldn't question. But debate must be bounded — unlimited back-and-forth wastes time.
What This Proves
Product judgment
Scope decisions, feature boundaries, and deliberate “no” choices that protected the core product model across hundreds of commits.
Technical coordination
Multiple AI agents (ChatGPT, Claude Chat, Claude Code, Codex) assigned distinct roles with no overlapping authority and no unscoped execution.
QA discipline
Every task verified through lint, build, Playwright automation, and manual browser/mobile testing before commit.
AI-native execution
AI generated the code. The operator controlled the architecture, sequencing, permissions, and quality gates.