Production AI Agent Architecture: Lessons From Building CrewKit

By Sharath Challa20 min read
#AI Agents#Architecture#Multi-Agent#Production#Engineering

Most multi-agent systems look great in demos and break the moment a real customer puts real work through them. The team has agents. The agents talk. The dashboard shows green. Then something fails — an agent loops forever, work gets dropped between handoffs, two agents fight over the same task, the LLM bill triples in a week — and the entire abstraction collapses.

We've shipped enough of these to have opinions. CrewKit is the multi-agent platform we built at V12 Labs after watching the same problems recur across client engagements. This post is about the architecture decisions that came out of those failures — what we tried, what didn't work, and the structure we landed on.

If you're building a production AI agent system, the patterns below should save you a few months.

The mistake almost every multi-agent system makes

The default mental model goes like this: "I have a complex business workflow, so I'll have a planner agent decompose it, a researcher agent gather information, a writer agent produce output, and a reviewer agent check quality. They'll talk to each other through messages."

This works in a notebook. It does not survive contact with production. Here's why.

Agents are not the primitive. Work is.

When you make agents the primitive, you've made the most volatile, non-deterministic, expensive part of your stack the source of truth. Agents have no durable identity, no persistent state, no enforceable boundaries. They are stateless functions that hallucinate. Building your system around them means every recovery from failure, every audit, every "what happened?" question goes through prompt archaeology.

The systems that hold up in production invert this. The durable, deterministic primitives — work units, ownership boundaries, evidence, approval gates, a hypothesis tree about the business — live in a real database and define the system. Agents become a runtime concern: the executors, not the structure.

CrewKit's authority order

When we redesigned CrewKit, we wrote down the order of authority before writing any code:

Founder Truths → Canonical Database Schema → Architecture → Runtime

Read that left to right. Product invariants come first. They define what the system must feel like to a customer regardless of any implementation detail. Then the database schema, which is the actual system model — not just persistence. Then the architecture document, which explains how the schema realizes the truths. Only at the end does the runtime — the agents, the prompts, the LLM calls — get to exist.

This is unusual for AI products. It's standard for serious distributed systems.

When something conflicts, the rule is explicit: code defers to architecture, architecture defers to schema, schema defers to truths. You can refactor the runtime without breaking the system because the system isn't in the runtime.

The persistent primitives that actually matter

Here's what lives in the database in CrewKit. Notice what's missing: there's no agents table at the top of the model. Agents exist, but they're not the abstraction.

Trigger. A normalized entry into the system. Could be a manual request, a scheduled cron, a watch signal on some external state, a webhook, or an approval decision. Every entry into execution is a trigger. This means the same downstream machinery handles "user clicked Start" and "monitoring alert fired at 3am" — the system doesn't have to care.

Op. The primary unit of execution. An op carries a goal, ownership, lifecycle state, recovery metadata, and a completion summary. Every meaningful business action maps to an op. If you can't answer "what is this op, why does it exist, who owns it, what state is it in, what's the outcome?" then it shouldn't exist.

Task. Claimable work items beneath an op. Tasks support concurrent execution, lease-based claims, retries, and recovery. Critically — tasks are an execution mechanism, not the top-level product abstraction. We made this mistake early. When tasks become the product surface, you start optimizing for task throughput instead of business outcomes.

Pod. A durable team and policy boundary. A pod owns business scope, budget scope, autonomy scope (full-auto, draft-approve, report-only), and coordination scope. Pods are how you give different parts of the business different trust levels without one global config. The marketing pod can be full-auto for outreach drafts. The finance pod is draft-approve. They don't share budgets or rules.

Specialist. A durable business-aligned role generated from business understanding. Specialists have specs, versions, and materializations — they evolve over time as the system learns. This is the structural adaptation layer.

Evidence. The grounding layer. Every meaningful output is backed by evidence with sources, confidence, and conflict tracking. When two pieces of evidence disagree, the system represents the conflict explicitly rather than picking a winner silently.

Clearance. The explicit human approval boundary. Not a flag on a task — a first-class entity with its own queue, lifecycle, and audit trail.

Outcome. The canonical learning primitive. What actually happened after the work shipped. This is what closes the loop from "we did the thing" to "did the thing work?"

Automation. The proactive trigger-to-action primitive. Watches for conditions and creates triggers. The thing that lets the system act on its own without losing the audit trail. (More on this below — it's bigger than it sounds.)

If you squint, you'll notice this is just a well-structured operational database for a business. That's the point. The reason it works is the same reason any business runs on operational databases instead of email threads: structure is what survives chaos.

Genesis: the part most multi-agent systems skip

Almost every multi-agent framework treats setup as a wizard. Pick a name. Set a goal. Click Start. The system has zero understanding of the business it's about to operate.

CrewKit has a phase for this called Genesis. It's not a setup wizard. It's the first serious attempt to understand the business before any work happens. Genesis produces:

  • Business profile, offers, positioning summary
  • ICP hypotheses and jobs-to-be-done hypotheses
  • Channel map and competitor set
  • Constraints and risks
  • Workstreams and priorities
  • Known unknowns and critical gaps
  • Pod recommendations and specialist recommendations

The output of Genesis is not a settings JSON. It's structured business memory plus a set of evidence-grounded hypotheses, plus an honest list of what the system doesn't know yet.

The rule we landed on: Genesis must be allowed to fail. If understanding is weak, Genesis should say "I don't have enough confidence here" rather than fabricate a profile. We've seen agent systems that pretend to understand a business based on five minutes of crawling a marketing website. They produce work for three months that's plausible and wrong. The ICP is off. The positioning is off. Every piece of output is contaminated by bad assumptions invisible to the customer.

The fix is to make uncertainty a first-class concept. Genesis writes business memory entries with confidence levels. The system reads those confidences before acting. Low-confidence assumptions get verified before they become the basis for outbound work. This is slow. It's also the reason customers will trust the output enough to ship it.

The Mind: how Chief actually thinks

This is the part of CrewKit that's genuinely different from everything else in the multi-agent space, and it's the part we figured out last.

The Chief — the meta-agent that sits above pods and specialists — needs to maintain a coherent picture of what the business should be doing this week to grow revenue. That's not a prompt. That's a structured, persistent, evolving model. We call it the Mind, and it has three primitives.

Branches. A revenue hypothesis tree. The root branch is "grow revenue." Children are segments, channels, and experiments. Each branch carries:

  • A hypothesis (the thing we're testing)
  • A confidence (0..1)
  • An expected value (heuristic, not dollars)
  • A status (discovery, active, paused, pruned)
  • An evidence count
  • A source (genesis, bootstrap, learned, founder)

A branch represents a coherent bet about how the business grows. "SMB consultants in healthcare via cold email" is a branch. "Enterprise SaaS via partner referrals" is another. The tree is durable. It evolves over time as branches gain or lose confidence.

Thoughts. Atomic units of Chief's reasoning. Every tick of the mind loop produces one thought row. A thought has:

  • An angle (freshness, gap, action, contradiction, opportunity, outcome, bootstrap, idle)
  • A trigger source (op_completion, automation_completion, clearance_resolved, inbound, founder_poke, idle)
  • The inputs it read (which branches, which memory items)
  • A natural-language narrative (the actual reasoning, in prose)
  • An extracted decision (one-line)
  • A confidence
  • An action type (observe, spawn_op, spawn_automation, request_clearance, update_branch, update_memory, ask_founder, noop)
  • The branch it affects, if any
  • A needs_attention flag if the founder should see it highlighted

Every meaningful thing Chief considers becomes a thought. The thought stream is queryable. The founder can scroll the mind feed and see the actual reasoning, with attention surfaced when something is uncertain or risky.

Moves. The bridge from thoughts to execution. When a thought decides to spawn a concrete primitive — an op, an automation, a clearance — it creates a move. A move records:

  • The thought that spawned it
  • The branch it's testing
  • The hypothesis being tested ("this segment will respond to this messaging")
  • The status of the spawned primitive
  • The outcome summary when the primitive completes
  • A confidence delta — how much this outcome moved the branch's confidence

This is the learning substrate. Every op, automation, and clearance carries a source_thought_id. When the op completes, the system looks up its source thought, finds the move, finds the branch, computes a confidence delta, and updates the tree. Then a new thought fires, with angle outcome, observing what just happened.

The loop:

  1. Trigger arrives (founder poke, op completion, idle tick, etc.).
  2. Chief reads relevant branches, memory, and recent thoughts.
  3. Chief produces a thought — narrative reasoning over current state.
  4. The thought decides on an action: spawn an op, spawn an automation, update memory, request clearance, ask the founder, or just observe.
  5. If the action spawns work, a move is created linking the thought to the new primitive.
  6. The primitive runs. When it completes, the move's outcome flows back as a confidence delta on the affected branch.
  7. The completion fires a new outcome-angle thought, which can spawn the next move.

This is what makes the system actually feel like an always-on operator instead of a chat interface with extra steps. The Chief is not generating a one-shot plan and executing it. It's running a continuous reasoning loop over a structured hypothesis tree, leaving an auditable trail, and updating its picture of the business as evidence comes in.

Three knowledge layers, not one

Most agent systems have "memory." It's a vector store with conversation snippets in it. Performance is bad in subtle ways and you can't tell why.

CrewKit has three durable knowledge layers and treats them as architecturally distinct:

Business Memory. Structured, business-specific understanding. ICP, offers, constraints, positioning. This is what Genesis produces and what the system reads before deciding anything strategic.

Knowledge Base. Entities, facts, and relationships, with confidence, sources, and conflict tracking. When Chief needs to know "what tools does Acme Corp's stack run on?" it reads KB facts, not vector chunks.

Memory Passages. Durable recall memory — actual passages of text indexed for semantic retrieval. Used for the things that genuinely benefit from fuzzy similarity: matching past interaction patterns, surfacing similar previous conversations, retrieving open-ended context.

The rule: the system should not collapse all knowledge into one undifferentiated memory mechanism. Vector search is wrong for facts. Structured tables are wrong for fuzzy recall. Both are wrong for "what does this customer believe about their own business?" — that's business memory, and it has its own retrieval shape.

When you collapse these into one thing, you optimize for the wrong tradeoff every time. When you separate them, each layer can be queried correctly, updated correctly, and verified correctly.

Automations: the proactive layer

Automations are the unified trigger-to-action primitive. They cover three things that are usually separate systems:

  • Watch — monitor an external signal, fire when conditions match
  • Schedule — fire on a cron or interval
  • Event-driven — fire when an internal event happens

In CrewKit they're all the same primitive with the same lifecycle, the same approval semantics, the same budget semantics, and the same learning semantics. An automation is:

  • Tunable — its behavior can be adjusted based on what it's learning
  • Capped — budget and rate limits live with the automation, not in some sidecar
  • Approval-aware — it knows when its outputs need clearance before going out
  • Auditable — every run produces a row, with inputs, outputs, and outcomes
  • Learnable — its recommendation_trust evolves over time as the founder approves or rejects what it suggests

The reason this matters: in most multi-agent systems, "the agent that watches for new leads" is one cron job, "the agent that drafts outreach" is a separate prompt, "the approval queue" is a third subsystem. They don't share state. They don't learn together. When the founder rejects a drafted email three times, nothing changes — the watcher keeps generating leads that produce drafts the founder will reject again.

In CrewKit, automations are first-class entities that compose with the rest of the schema. The watcher writes evidence and triggers an op. The op produces a draft. The draft hits clearance. The clearance decision flows back as an interaction event. The automation's recommendation_trust adjusts. The next thought from the mind loop sees the trust change and updates the corresponding branch's confidence. The whole system gets smarter from one rejection.

The learning loop

Learning in most agent systems is "we'll fine-tune the prompt next sprint." That's not learning. That's prompt drift.

In CrewKit, learning is durable state change. The inputs are concrete:

  • Op outcomes
  • Approvals, rejections, and edits
  • Interaction events
  • Recommendation responses (yes / no / changed)
  • Downstream signals (did the email get a reply? did the lead convert?)

The outputs are also concrete:

  • Confidence adjustments on KB facts and branches
  • Policy changes on pod rules
  • Automation tuning (rate limits, thresholds, content templates)
  • Specialist evolution (new versions of specialist specs based on what's working)
  • Memory refinement (passages get re-ranked, deprecated, updated)
  • Structural adaptation (a pod gets created, retired, or reshaped)

Every one of these is a database row. Every one is queryable. You can ask "why does this automation now require approval when last week it didn't?" and walk the chain backwards through policy_changes, find the rejection that triggered it, and see the founder edit that started the policy change.

The rule: learning should not be treated as vague prompt drift. If a behavior change can't be traced to a database row, it shouldn't be happening.

The canonical execution loop

Every meaningful behavior in CrewKit reduces to the same loop:

  1. Trigger arrives. Manual, scheduled, watched, webhook, automation — same surface.
  2. Trigger is routed. Which pod, which op type, which constraints.
  3. Op is created or selected. Either net new or matched to an in-flight op.
  4. Work is assigned. To a pod, a specialist, decomposed into tasks.
  5. Execution happens. Async. Agents run, claim tasks, write evidence.
  6. Evidence, notes, activities, and artifacts are written. Continuously.
  7. Clearance is requested if required. Pod policy decides.
  8. Op completes or fails. Centralized lifecycle.
  9. Outcome is recorded. What actually happened downstream.
  10. Learning and structural updates apply. Confidence adjustments, policy changes, specialist versions, branch updates from the move's outcome.

There is no second hidden orchestration loop. There is no "well in this case the agent just calls another agent directly." If you find yourself building a shortcut around this loop, you have a design problem, not an implementation problem.

What we tried that didn't work

Most of what's in the architecture is there because we hit a wall with something else first.

We tried prompt-based agent handoffs. The planner agent's prompt said "delegate to the researcher when you need facts, then delegate to the writer when you have facts." It worked sometimes. It also produced infinite loops, dropped handoffs, and impossible-to-debug failures because the entire control flow lived in three LLM calls. We replaced this with explicit task creation in the database.

We tried letting agents share context through long conversation history. Token costs exploded. Latency exploded. And the agents started "remembering" things that weren't there because they were pattern-matching on conversation shape instead of real state. We replaced this with structured evidence and KB facts. Agents read the database, not each other's transcripts.

We tried global budgets. Worked great until we wanted one pod to be aggressive and another conservative. Pods now own their own budgets, and the budget is part of pod governance, not a separate bookkeeping system bolted on top.

We tried treating "approval needed" as a boolean on tasks. Then we needed to know who could approve, what the SLA was, what happens if approval is rejected vs. ignored. Clearance became a first-class entity with its own queue and lifecycle. The "approval flag" approach scales to your first three customers and then collapses.

We tried letting Chief modify the agent roster on the fly through prompts. This was actually fine for a single-tenant prototype. As soon as we cared about audit trails — "why does this customer's instance have a Closer specialist that no one explicitly added?" — we needed specialists to be schema-tracked entities with versions and materializations.

We tried prompt-only learning. "Just put the rejection reason in the next prompt as context" worked for one rejection, then forgot it the next time the agent fired. We moved learning into durable state changes — confidence deltas, policy_changes rows, recommendation_trust updates — and the system actually got smarter between sessions.

We tried one giant memory store. Vector DB, everything thrown in. Retrieval was bad in different ways depending on what was being asked. We split into three layers (Business Memory, KB, Memory Passages) and each one suddenly worked correctly.

We tried treating Chief as just another agent with a bigger prompt. It generated plans that read well and didn't compose with the rest of the system. We added the Mind layer — branches, thoughts, moves — and Chief became a continuous reasoning loop over durable structure instead of a one-shot planner.

The pattern across all of these: the things you push into prompts get expensive and unobservable. The things you push into the database get cheap and queryable. Push hard.

Anti-patterns we now refuse

These are the architectural anti-patterns we won't tolerate in CrewKit, and we'd recommend any team building production agent systems treat them the same way:

  • Re-centering the system on missions as the universal primitive. Missions sound product-friendly and are architecturally lazy. Use ops.
  • Treating orchestration as the product. Orchestration is a means. The product is business outcomes.
  • Inventing core abstractions that do not map to the DB. If it's not in the schema, it doesn't exist.
  • Hiding state transitions in random handlers. Lifecycle is centrally owned and explicit.
  • Letting runtime behavior drift from the schema. The schema is the contract.
  • Using prompt-only learning instead of durable system updates. Already covered.
  • Hiding uncertainty. Confidence is a first-class field. Low-confidence claims are not stated as fact.
  • Creating per-task fake specialists instead of durable roles. Specialists are versioned, durable, and aligned to real workstreams.

Runtime ownership: who does what

Architecture is one thing. The actual runtime is another. Here's how CrewKit splits responsibilities:

The Dashboard owns: API routes, op lifecycle, task lifecycle, all DB writes and reads, tool execution registry, the mind loop scheduler, and most business control behavior. This is a Next.js application with the database next to it. Boring on purpose.

OpenClaw Gateway owns: Agent execution runtime, transport, session handling, the tool-call interaction loop, and agent spawning. This is where the LLM calls live and where agents actually run.

LibSQL is the persistent system store. SQLite-compatible, fast, easy to back up, easy to reason about.

Caddy sits in front for TLS and reverse proxy.

The split matters. The Dashboard is the application control plane — it owns the system model. The Gateway is a runtime — it doesn't decide what work means, it executes work. If you let your runtime own your system model, you've inverted the dependency and you're going to have a bad time when you change LLM providers, change agent frameworks, or refactor.

Why this matters for your buyer

If you're a founder or a CTO evaluating whether to build something like this in-house or hire someone — the architectural choices above are the difference between "we shipped a demo" and "we shipped something a paying customer can run their business on."

The teams that ship demos optimize for the agent layer. They pick a multi-agent framework, write clever prompts, build a dashboard that shows agent activity, and call it a system.

The teams that ship production systems optimize for the system model. They write down their truths. They design a database that captures their primitives. They build a Mind that maintains a real picture of the business. They keep agents in the runtime and out of the architecture.

If you're heading down the first path, you'll know within a few months. The first paying customer will surface a dozen "what happened?" questions you can't answer. The bill will be larger than expected. Failures will be hard to reproduce. Trust will erode.

The second path is slower at the start and faster forever after. We rebuilt CrewKit toward this shape because the first version of the platform was the first path.

Decision filter

Before adding or changing anything in CrewKit, we ask six questions:

  1. Which founder truth does this serve?
  2. Which DB primitive does it map to?
  3. Does it fit the canonical execution loop?
  4. Does it increase real capability or only orchestration complexity?
  5. Does it preserve evidence, trust, and learning?
  6. Would this still make sense if runtime implementation changed tomorrow?

If the answers are weak, the change is probably drift. We've used these questions to kill more "cool ideas" than I can count, and the system is stronger for it.

The shortest version of this advice

If we had to compress everything we learned into seven rules:

  1. Make the database your system model, not a side effect of agent behavior.
  2. Keep agents out of the architecture. They belong in the runtime.
  3. Build a Genesis phase that's allowed to fail. Don't act on weak understanding.
  4. Build a Mind, not a planner. A continuous reasoning loop over a hypothesis tree beats one-shot plans every time.
  5. Separate your knowledge layers. Business memory, KB, and recall passages are not the same thing.
  6. Make automations first-class with their own learning, budgets, and approval semantics.
  7. Treat learning as durable state change, not prompt drift.

These look obvious written down. They are not obvious in practice — every multi-agent codebase we've inherited from another team violates at least four of them.

We help startups build this kind of system

V12 Labs builds production AI agent systems for startups and growing companies. CrewKit is the platform we use for managed deployments — fully isolated environments, real ownership boundaries, explicit governance, durable learning. We also build custom AI agent systems on the same architectural principles when clients need something specific to their business.

If you're trying to figure out whether your AI agent project should be built on a framework, on a managed platform, or custom from scratch — book a call and we'll talk through the architecture before anyone writes code.

The right answer depends on your business, your trust requirements, and what you're actually trying to ship. But the architecture — the part above — is the same either way. Get that right and the implementation is just typing.