Craftmindship logo

Multi-Agent Development Workflow

Case Study

A controlled system for using ChatGPT, Claude Chat, Claude Code, Codex, and Playwright to ship production software without agent drift.

At a Glance

Role
Operator · AI orchestrator
Status
Documented methodology — active across Cadence + Puppy Program OS
Domain
AI-native software development
Stack
Claude Chat · Claude Code · ChatGPT · Codex · Playwright · GitHub · Vercel · Supabase
What I did
Designed a 7-step loop coordinating multiple AI models across planning, implementation, QA, and deployment

The Interchangeability Trap

Most builders treat AI coding tools as interchangeable. That creates context loss, tool drift, and compounding bugs. The real unlock is routing work to the right agent with clear lane separation and handoff rules.


The Orchestration Tax

Running more agents does not remove the human bottleneck. It usually moves the bottleneck to review, judgment, merge control, and architectural consistency.

In AI-assisted development, the scarce resource is not AI output. The scarce resource is operator attention.

That is why this workflow is intentionally sequential. Each task moves through one executor, one repo state, one QA gate, and one commit boundary. This reduces orchestration tax: fewer conflicting diffs, fewer cold context reloads, and less risk of merging code I no longer understand.

The bottleneck in AI-assisted development is no longer code generation. It is review throughput, judgment quality, and keeping a live mental model of the system.


The Problem

Without workflow discipline

  • Vague mega-prompts to AI
  • Agent drifts from product model
  • Unclear who owns each decision
  • Uncommitted changes pile up
  • QA is an afterthought
  • 4–5 hours lost to dirty worktree recovery

With structured workflow

  • Scoped, single-task prompts
  • One executor per task
  • Human owns product, agent owns execution
  • Clean commit after every task
  • QA gate before next task
  • Predictable, repeatable shipping

Workflow Map

Load project contextPRD · architecturedesign locks · repo stateproject statePlan taskChatGPT + Claude Chattask scopeWrite scoped handoffOperator scopingscoped promptExecuteClaude Codecode diffPatch / refactorCodex · optionalreviewed diffQA gatelint · build · Playwrightbrowser + mobile QApassCommit + verifyGit checkpoint + deploy checkfailnext scoped taskOperatorAI planningAI executionGateCheckpoint

Agent Role Matrix

ToolRoleMust NOT do
Claude ChatPlanning, architecture, prompt design, scope decisionsRandom coding without a plan
ChatGPTDebate, critique, alternative perspectivesOwn the implementation
Claude CodeMain executor — features, fixes, RLS, git operationsBroad product decisions without human direction
CodexNarrow patches, refactor, cleanup, secondary reviewOwning the full repo or rewriting working code
PlaywrightReal-browser QA, regression testing, flow verificationReplacing human judgment on UX
GitCheckpoint, handoff boundary, clean state enforcementMessy parallel work on the same codebase

Lane crossing debt

When a tool works outside its lane, the short-term speed gain creates downstream handoff debt: a prototyping tool touching database or security logic, a code executor making product-scope decisions, or a chat assistant producing implementation instructions without repo context.

Common Anti-Patterns

Anti-patternWhat breaksBetter operating rule
Treating all AI tools as interchangeableContext gets smeared across tools and each agent repeats or contradicts prior decisions.Assign each tool a narrow lane with explicit authority boundaries.
Sending vague mega-promptsThe agent optimizes for breadth, guesses at product intent, and produces hard-to-review diffs.Send one scoped task with clear inputs, constraints, and expected output.
Letting executor agents make product decisionsImplementation speed starts overriding product model, permission, and UX tradeoffs.Humans own scope and product judgment; executor agents own the assigned implementation.
Switching tools without a committed checkpointThe next tool inherits partial context, dirty diffs, and unclear ownership.Handoff only after a reviewed checkpoint with current repo state captured.
Debugging with repeated speculative promptsFixes stack on guesses, masking the root cause and creating new regressions.Inspect the failing state, isolate the cause, then prompt for the smallest corrective change.
Letting context run too long without resetOld assumptions linger and the agent starts optimizing against stale constraints.Reset with a fresh summary, current files, and the latest accepted decisions.

The Operating Loop

1Plan

Scope the task in chat. Write the prompt.

2Execute

One agent, one task, one repo state.

3Build

npm run build + lint must pass.

4Browser

Playwright or manual mobile check.

5Commit

Clean git commit with descriptive message.

6Next

Load updated project state. Repeat.

↺ repeat from Plan

No parallel agents on the same repo. No vague mega-prompts. No uncommitted handoff mess. One prompt, one task, one commit.

I use tools like Lovable, Stitch, Canva, and Gemini for fast layout exploration, visual direction, and rapid idea testing, not as final authority over the product. Production-facing work is tightened through local development, Claude Code/Codex refactors, Supabase review, Vercel deployment checks, and manual QA.


Real Example — Cadence Task 8

Teacher UI Polish: one task moving through the full workflow.

1
Claude Chat — Planning

Redesign teacher dashboard student cards from simple list to 4-column × 2-row layout with book-level color coding, parent status, practice frequency, and next lesson time.

→ Output: Design lock document (cadence_task_8_design_lock.md)

cadence_task_8_design_lock.md

2
Claude Code — Execution

Implement BookChip component, refactor TeacherDashboard, add bookColors.ts constants, wire up v_student_weekly_log_count view.

→ Output: 8 files changed, new component, constants file, view integration

BookChip.tsx · bookColors.ts · TeacherDashboard refactor

3
Playwright + Manual — QA

Verify card rendering at 380px on iPhone Safari. Check book colors match spec. Confirm parent status shows correct ternary state.

→ Output: All checks pass. One spacing fix caught on mobile.

✓ 4/4 Playwright tests passed

4
Git — Commit

feat: teacher dashboard student card 4×2 layout with BookChip component

→ Output: Clean commit. Project state updated. Ready for next task.

feat: teacher dashboard student card 4×2 layout


Four Lessons

1. Planning artifacts are shared memory

8 markdown files (~40 pages) served as context for every AI agent. Each session started by loading current state, not re-explaining the product. Without this, every conversation starts from zero.

2. Sequential over parallel

Running multiple agents on the same codebase at the same time creates merge conflicts, duplicated work, cold context reloads, and compounding bugs. Sequential execution with clean handoffs is slower per-task but faster per-project because it protects the operator's review bandwidth.

3. Context degrades at model boundaries

When work crosses from Claude Code to Codex or back, context is partially lost. The planning docs bridge that gap. Without them, each agent reinvents decisions the previous one already made.

4. Selective debate improves decisions

Using ChatGPT to challenge Claude's architecture suggestions (or vice versa) caught assumptions that a single model wouldn't question. But debate must be bounded — unlimited back-and-forth wastes time.


What This Proves

Product judgment

Scope decisions, feature boundaries, and deliberate “no” choices that protected the core product model across hundreds of commits.

Technical coordination

Multiple AI agents (ChatGPT, Claude Chat, Claude Code, Codex) assigned distinct roles with no overlapping authority and no unscoped execution.

QA discipline

Every task verified through lint, build, Playwright automation, and manual browser/mobile testing before commit.

AI-native execution

AI generated the code. The operator controlled the architecture, sequencing, permissions, and quality gates.