Why Architecture Reviews Need Contracts, Not Chat

Why Architecture Reviews Need Contracts, Not Chat
Why Architecture Reviews Need Contracts, Not Chat

Architecture reviews have a translation problem.

Humans can leave a thread of “consider X” and “what about Y” and resolve the rest in a meeting. But if you want an LLM to participate in a workflow that resembles engineering - PRs, ADRs, ticketing, CI gates - fluent feedback isn’t enough. You need output that downstream systems (and humans) can reliably act on.

Most “LLM architecture review” demos stop at persuasive prose. The result reads like an experienced engineer, but it isn’t shaped like an artifact: it’s hard to rank, route, deduplicate, or turn into work without a second manual pass.

Multi-agent helps because architecture review is a bundle of lenses - security, scalability, operability, cost, data integrity, failure recovery - each with its own heuristics and thresholds. But the real differentiator isn’t “one agent vs. many”. It’s contracts. With PydanticAI, you define the schema the system must emit and validate every response; Claude supplies the reasoning, but the contract forces it into a machine-actionable shape.

This article shows how to build a multi-agent architecture reviewer that produces a structured review artifact: normalized findings with severity, evidence, and recommendations, plus clarifying questions and explicit “needs human judgment” flags. Think less chatbot, more review report.

The full runnable example lives in the companion repository:

In the next sections, we’ll define the minimal topology (planner → specialists → synthesizer), the shared contracts, and an end-to-end walkthrough from design doc to structured report.


What Multi-Agent Buys You (and What It Costs)

Multi-agent systems are easy to overuse. If all you need is a single round of feedback - “scan this design doc for obvious risks” or “suggest alternatives for this storage layer” - a well-constructed prompt with a structured output schema can get you most of the value with a fraction of the complexity. The moment you add agents, you add orchestration, state, and failure handling; if you don’t get clear returns, you’ve just built a slower, more expensive version of a single call.

So what does multi-agent actually buy you for architecture review? Primarily: separation of concerns and parallel perspectives. A single “review this architecture” prompt tends to collapse into a handful of generic patterns: it repeats the same risks across categories, it misses domain-specific edge cases, and it blurs “ask a question” with “assert a fact”. Splitting the work into roles lets you put sharper instructions and narrower context windows in front of each agent, and it gives you a natural place to express uncertainty (“as the security reviewer, I need X to conclude Y”).

Monolithic prompts also fail in predictable ways. They produce long outputs that are hard to rank; they mix high-confidence issues with speculative ones; they contradict themselves; and they often lack traceable evidence tied to the input. Those failures aren’t just aesthetic - they make downstream automation unreliable. If you can’t consistently tell what is a “P0” vs. a “P3”, or what is a requirement vs. a suggestion, you can’t turn the review into an engineering workflow.

The costs are real. Fan-out (running multiple specialists) increases token spend and latency, and it introduces coordination overhead: you need to manage shared context, avoid duplication, and merge disagreements into a single report. Sequential pipelines can reduce duplication and keep context tighter, but they can amplify early mistakes (if the first step narrows scope incorrectly, everything downstream inherits that error). In practice, you’re trading a single model call for a small distributed system.

For a lean architecture reviewer, a minimal topology works well:

  • A planner reads the input and decides which lenses to apply (security, scalability, operability, data, cost), plus any clarifying questions.
  • A small set of specialists each run one lens and emit findings in a shared schema.
  • A synthesizer merges the results: deduplicates, ranks, resolves contradictions (or preserves dissent explicitly), and produces the final structured review artifact.

The decision heuristic is simple: stay lean unless you need genuinely different perspectives. If your review output is consistently repetitive, shallow, or internally inconsistent, specialist roles can help. If you’re already getting high-quality structured output from one call, don’t add agents - add better contracts, better evidence requirements, and better “unknowns” handling.


Contracts First: Pydantic Models as the Agent API

If you only take one idea from this article, it should be this: the schema is the product. The model is a reasoning engine, but your system’s reliability comes from the contract you force that reasoning to satisfy. Most agent demos skip this and then wonder why their outputs can’t be trusted in an engineering workflow.

Architecture review is especially sensitive to this because it’s full of ambiguity. Some findings are hard requirements (“this violates a compliance constraint”), some are conditional (“if traffic can spike 10×, you need backpressure”), and some are simply questions (“what is your RTO/RPO?”). Without a contract that distinguishes these categories - and forces the model to attach evidence or explicitly mark unknowns - you end up with prose that sounds plausible but is operationally unusable.

In Pydantic terms, you want to model the output as a small set of types that match how engineers actually consume reviews. A typical core might include:

  • A top-level ArchitectureReview artifact (metadata, overall risk, summary).
  • A list of Finding objects with fields like title, severity, category, and recommendation.
  • An evidence or references field that ties the claim to something in the input (or marks it as an inference).
  • A separate list of questions for missing information that blocks a confident conclusion.

The exact fields are less important than the discipline: every claim must land somewhere explicit. Severity can’t be buried in adjectives. Recommendations can’t be scattered across paragraphs. Uncertainty can’t be implied; it needs a home in the schema.

Contracts also apply to inputs. Each agent role should have a clear definition of what context it is allowed to assume and what it must treat as unknown. That’s the difference between a specialist saying “use mTLS” by default versus saying “if there are multi-tenant boundaries or untrusted networks, consider mTLS; otherwise justify why it’s unnecessary”. The more explicit you are about inputs (constraints, traffic assumptions, data classification, SLOs), the less your agents have to guess.

Once you have contracts, you can make the system resilient with validation and repair loops. With PydanticAI-style structured outputs, you can:

  • Validate every model response against the schema.
  • Retry on validation failure with a targeted instruction (“output must include evidence for each finding”).
  • Apply a small “repair” step when the model is close but not quite compliant (missing a field, wrong enum value).

This isn’t about being pedantic. It’s what turns “LLM output” into “typed data” you can route downstream: create issues in a tracker, post a summarized comment to a PR, generate an ADR checklist, or feed a dashboard that tracks the recurring classes of architectural risk your org keeps rediscovering.

Multi-agent orchestration only works well when everyone speaks the same language. The contract is that language.

If you want to go deeper on modeling, validation, and the patterns that make Pydantic contracts reliable in production, see Practical Pydantic - a hands-on guide to data validation in Python, from core concepts through real-world APIs and pipelines.

Code example: a minimal contract set

The working example in models.py defines a small set of Pydantic models that act as the “agent API” for planning, specialist review, and synthesis:

from enum import Enum
from typing import Literal

from pydantic import BaseModel, Field


class Lens(str, Enum):
    security = "security"
    scalability = "scalability"
    operability = "operability"
    data_integrity = "data_integrity"


class Severity(str, Enum):
    p0 = "p0"
    p1 = "p1"
    p2 = "p2"
    p3 = "p3"


class Evidence(BaseModel):
    kind: Literal["quote", "observation", "inference"] = "observation"
    detail: str = Field(..., description="Quote or observation tied to the provided input.")


class Finding(BaseModel):
    title: str
    category: str = Field(..., description="Short category label (e.g. authz, backpressure).")
    severity: Severity
    evidence: list[Evidence] = Field(default_factory=list)
    recommendation: str


class PlannerOutput(BaseModel):
    lenses: list[Lens]
    scope_notes: str = ""
    clarifying_questions: list[str] = Field(default_factory=list)


class ArchitectureReview(BaseModel):
    summary: str
    overall_risk: Literal["low", "medium", "high"] = "medium"
    findings: list[Finding] = Field(default_factory=list)
    questions: list[str] = Field(default_factory=list)

Role Design: Planner, Specialists, Synthesizer

The fastest way to ruin a multi-agent system is to give every agent the same vague job: “review the architecture”. You’ll pay for multiple calls and still get duplicated, generic output. Role design is where multi-agent becomes an engineering tool instead of a prompt trick.

A useful rule is: each role must have a one-sentence job description that is both necessary and non-overlapping. If two roles can produce the same kind of output, you haven’t created separation of concerns - you’ve created redundancy.

For a lean architecture reviewer, three roles are enough:

Planner

The planner’s job is to decide what review should happen given the input and constraints. It does not emit a full review. It produces:

  • Which lenses to apply (security, scalability, operability, cost, data integrity, compliance, etc.).
  • Any clarifying questions that block a confident review (“What’s the expected peak RPS?”, “What is the data classification?”).
  • Optional scoping notes (“Focus on failure recovery and multi-region; ignore UI concerns.”).

This role is where you avoid wasted work. If the architecture is a batch pipeline with no public ingress, a deep web security pass is noise. If it’s a multi-tenant SaaS, ignoring tenant boundaries is negligence. The planner sets those priorities explicitly.

Specialists

Each specialist’s job is to run one lens and emit findings in the shared contract. Specialists should not:

  • Re-scope the review (“I think we should also do operability”).
  • Invent missing context as if it were true.
  • Produce long narrative prose that the synthesizer can’t merge.

They should be opinionated within their lens, but disciplined about uncertainty. A good specialist output contains high-signal findings and clear questions when key context is missing. The contract is the forcing function: each finding needs a category, a severity, evidence, and a recommendation.

Synthesizer

The synthesizer’s job is to produce the final ArchitectureReview artifact. That means:

  • Deduplicate overlapping findings across specialists.
  • Resolve contradictions when possible, or preserve dissent when it matters (“Security flags P0 unless X; Scalability says acceptable if Y”).
  • Rank and prioritize based on severity and expected impact.
  • Produce a concise summary that is consistent with the structured findings, not an independent “new” review.

The synthesizer is also where you enforce global policy: severity definitions, house style, and what counts as acceptable evidence. In other words, it turns a bag of lens-specific opinions into a single report that an engineering team can act on.

Prompt boundaries

Prompt boundaries are not decorative; they prevent scope creep and hallucinated authority. Each role should have explicit “must not” constraints. Examples:

  • The planner must not emit findings.
  • Specialists must not rewrite the contract or invent missing facts.
  • The synthesizer must not add new findings that were not supported by specialist outputs or input evidence (unless explicitly marked as an inference with low confidence).

When roles are crisp, orchestration becomes straightforward: you know what inputs each step needs and what outputs it is allowed to produce. When roles are fuzzy, you’ll spend your time chasing inconsistencies and blame-shifting between agents.

Code example: defining role agents

In the working example (reviewer.py), each role is an Agent with an output_type set to one of the contracts, and all roles share the same dependency type (ReviewDeps) carrying the design doc.

Passing deps=ReviewDeps(design_doc=...) alone is not enough: PydanticAI does not automatically inject dependencies into the prompt. Use dynamic instructions to attach the document to every run:

from __future__ import annotations

import os
from dotenv import load_dotenv
from dataclasses import dataclass

from pydantic_ai import Agent, RunContext

from models import ArchitectureReview, Lens, PlannerOutput, SpecialistOutput


load_dotenv()

DEFAULT_MODEL = os.getenv("MODEL", "anthropic:claude-sonnet-4-6")

if not os.getenv("ANTHROPIC_API_KEY"):
    raise RuntimeError(
        "ANTHROPIC_API_KEY is not set. Export it or add it to a .env file in the project root."
    )


@dataclass(frozen=True)
class ReviewDeps:
    design_doc: str


planner = Agent(
    DEFAULT_MODEL,
    deps_type=ReviewDeps,
    output_type=PlannerOutput,
    instructions=(
        "You are an architecture review planner. "
        "Given the design doc, choose which review lenses to run, "
        "write scoping notes, and list clarifying questions. "
        "Prefer asking questions over guessing."
    ),
)


@planner.instructions
def planner_design_doc(ctx: RunContext[ReviewDeps]) -> str:
    return f"Design doc to review:\n\n{ctx.deps.design_doc}"


def make_specialist(lens: Lens) -> Agent[ReviewDeps, SpecialistOutput]:
    specialist = Agent(
        DEFAULT_MODEL,
        deps_type=ReviewDeps,
        output_type=SpecialistOutput,
        instructions=(
            f"You are the {lens.value} specialist for an architecture review.\n"
            "Return findings using the output schema.\n"
            "- Every finding should include concrete evidence tied to the input.\n"
            "- If you lack evidence, ask a question instead of inventing facts.\n"
            "- Be concise; prioritize the highest-impact issues."
        ),
    )

    @specialist.instructions
    def specialist_design_doc(ctx: RunContext[ReviewDeps]) -> str:
        return f"Design doc to review:\n\n{ctx.deps.design_doc}"

    return specialist


synthesizer = Agent(
    DEFAULT_MODEL,
    deps_type=ReviewDeps,
    output_type=ArchitectureReview,
    instructions=(
        "You are the synthesizer for a multi-agent architecture review.\n"
        "Merge specialist outputs into one ArchitectureReview:\n"
        "- Deduplicate overlapping findings.\n"
        "- Rank by severity and impact.\n"
        "- If specialists disagree, either resolve via evidence or preserve uncertainty.\n"
        "- Keep the summary consistent with the structured findings."
    ),
)


@synthesizer.instructions
def synthesizer_design_doc(ctx: RunContext[ReviewDeps]) -> str:
    return f"Design doc that was reviewed:\n\n{ctx.deps.design_doc}"

...


Routing Topology: Fan-Out, Sequence, and When to Stop

With roles defined, the next question is routing: in what order do you run them, what state do you pass, and when do you stop? “Agent frameworks” often treat routing as a generic problem (graphs, tool routers, memory stores). Architecture review is narrower. You want a topology that is predictable, auditable, and cheap enough to run often.

There are three common patterns that fit this use case:

Fan-out then synthesize (standard)

Planner → Specialists (parallel) → Synthesizer.

This is usually the sweet spot. The planner scopes and selects lenses, specialists run independently in parallel, and the synthesizer merges. Parallelism gives you speed and reduces the chance that one lens anchors another. The cost is duplication and conflict, which you then handle in synthesis.

Gated sequence (when context is expensive)

Planner → Specialist A → Specialist B → … → Synthesizer.

Sequential routing makes sense when later steps depend on structured state produced earlier (e.g., the planner extracts a component inventory; specialists review component-by-component). The risk is error propagation: a missed component early can cause systematic blind spots.

Two-pass loop (only when you need it)

Planner → Specialists → Synthesizer → (optional) targeted re-asks → Synthesizer.

If you do loops, keep them narrow. The goal isn’t “let the agents think longer”. It’s to repair specific defects: missing evidence, unclear severity, or unresolved contradictions. Targeted re-asks are cheaper and more reliable than open-ended “review again”.

State passing: structured, not conversational

The “chat history soup” failure mode is real: you pass the entire transcript to every agent and hope they find what they need. The result is inconsistent emphasis and increasing token waste. For architecture review, treat state as data:

  • The raw input (design doc excerpt, constraints, assumptions).
  • A structured planner output (selected lenses, clarifying questions, scope notes).
  • A shared contract for specialist findings.
  • The synthesizer’s merged artifact.

This keeps each step anchored to the same fields, and it makes runs auditable: you can see which agent produced which finding and why.

Stopping conditions

In a lean system, stopping conditions should be boring and strict. Common rules:

  • Max rounds: no unbounded loops; if you need a third pass, you likely need a better contract or better inputs.
  • Empty findings: if specialists return no issues, don’t “try harder” unless the planner flagged missing context.
  • Low-confidence signals: if evidence is missing, prefer explicit questions/unknowns over additional speculative rounds.

Reference topology

The reference topology we’ll use in the end-to-end walkthrough is:

Planner → (Security, Scalability, Operability, Data/Integrity) specialists in parallel → Synthesizer.

If you add anything, add it reluctantly - and only after you can name the specific failure mode it fixes.

I review Python and AI codebases for security gaps, production readiness, and long-term maintainability. If that's something your team needs, let's talk.

Code example: orchestration (planner → specialists → synthesizer)

The same example wires the routing logic directly in Python, passing structured state (planner output, specialist JSON) rather than a growing chat transcript:

...

def run_review(design_doc: str) -> tuple[PlannerOutput, list[SpecialistOutput], ArchitectureReview]:
    deps = ReviewDeps(design_doc=design_doc)

    plan = planner.run_sync(
        "Plan the review lenses and questions for this design doc.",
        deps=deps,
    ).output

    specialist_outputs: list[SpecialistOutput] = []
    for lens in plan.lenses:
        specialist = make_specialist(lens)
        specialist_outputs.append(
            specialist.run_sync(
                (
                    "Review this architecture with your lens. "
                    f"Scope notes: {plan.scope_notes or '(none)'}"
                ),
                deps=deps,
            ).output
        )

    synthesis_input = (
        "Synthesize the final ArchitectureReview from:\n\n"
        f"Planner clarifying questions:\n- "
        + "\n- ".join(plan.clarifying_questions or ["(none)"])
        + "\n\nSpecialist outputs:\n"
        + "\n\n".join(o.model_dump_json(indent=2) for o in specialist_outputs)
    )
    final = synthesizer.run_sync(synthesis_input, deps=deps).output
    return plan, specialist_outputs, final

End-to-End Example: One Review from Design Doc to Structured Output

An architecture review only becomes real when you can run it on a concrete input and get a structured artifact out the other side. This section sketches a single “happy path” run.

Walkthrough inputs

At minimum, you want three kinds of input:

  • Architecture description: the system diagram in prose - components, dependencies, data flows, and boundaries.
  • Constraints: what is non-negotiable (compliance, latency targets, cloud restrictions, tenancy model).
  • Known risks / focus areas: what the team is already worried about (migration, multi-region, PII, cost).

The biggest determinant of review quality is whether these are explicit. If the input does not state assumptions, the reviewer will either guess (bad) or ask a lot of questions (good but slower). Your contract should reward “ask a question” rather than “invent a fact”.

Reference implementation sketch

Conceptually, you define three things:

  1. Schemas (Pydantic models) for the planner output, specialist findings, and the final review artifact.
  2. Agents bound to those schemas (planner agent, specialist agents, synthesizer agent).
  3. Orchestration that routes structured state between them.

Claude is the model behind each agent; PydanticAI is the layer that forces responses to fit the schema and provides retries/repairs when they don’t.

Clone the repository, set ANTHROPIC_API_KEY, and run python run_review.py to reproduce the walkthrough below.

Happy path routing

The run looks like this:

  1. Planner call: read the input, select lenses (e.g., Security, Scalability, Operability, Data/Integrity), and emit clarifying questions if required.
  2. Specialist fan-out: run each lens with the same input plus planner scope. Each specialist emits a list of Finding objects (and questions/unknowns if the contract supports them).
  3. Synthesizer merge: merge the lists into a final ArchitectureReview artifact: dedupe, rank, and normalize severity.

If you log inputs and outputs at each step, this pipeline is easy to debug: you can see whether the planner scoped incorrectly, whether a specialist missed evidence, or whether synthesis merged incorrectly.

Example input and output

The final artifact should be something you can paste into a PR comment and parse as data. A good output has:

  • A short summary (what’s good, what’s risky).
  • Ranked findings with clear severities and categories.
  • Evidence tied to the input (or explicit “inference” flags).
  • Actionable recommendations (what to change, what to measure, what to decide).
  • A small set of clarifying questions that genuinely block conclusions.

The companion repository includes a toy design doc at sample_design.md - short on purpose, but enough for specialists to anchor findings to real statements:

# Example design doc (toy)

We are building a multi-tenant SaaS API that ingests events from customer apps.

## Components

- Public REST API behind an API gateway.
- Worker service that processes events asynchronously.
- Postgres for tenant metadata and configuration.
- S3 for raw event payload storage.
- Redis for rate limiting and job deduplication.

## Constraints / assumptions

- Tenants are identified by an API key.
- Peak: 50k events/sec across tenants, spikes up to 5×.
- PII may be present in event payloads.
- 99.9% availability target for ingestion endpoint.

## Known risks / focus

- We previously had incidents from retry storms.
- We need to support data deletion by tenant (GDPR-style).

Running python run_review.py against that file produces out_review.json. Here is an excerpt of the structured review (truncated for readability):

{
  "summary": "This multi-tenant SaaS event-ingestion platform has a **high overall risk profile** driven by four converging concern areas: (1) an undefined API key lifecycle… (2) absent tenant isolation controls… (3) PII in S3 with no encryption-at-rest strategy… (4) a GDPR deletion requirement spanning Postgres, S3, and Redis with no coordination mechanism…",
  "overall_risk": "high",
  "findings": [
    {
      "title": "GDPR Deletion Lacks Cross-Store Atomicity, Completeness Guarantee, and Audit Trail",
      "category": "compliance",
      "severity": "p0",
      "evidence": [
        {
          "kind": "quote",
          "detail": "'We need to support data deletion by tenant (GDPR-style)' is listed as a known risk but no deletion workflow, ordering, or rollback strategy is documented."
        },
        {
          "kind": "observation",
          "detail": "Data is spread across three independent stores — Postgres, S3, and Redis — with no described coordination mechanism."
        }
      ],
      "recommendation": "Implement a saga/orchestration pattern for tenant deletion that tracks per-store deletion state…"
    },
    {
      "title": "API Key Lifecycle Management Undefined — No Rotation, Revocation, or Scoping Controls",
      "category": "authn",
      "severity": "p1",
      "evidence": [
        {
          "kind": "quote",
          "detail": "Tenants are identified by an API key — no mention of key rotation, revocation, expiry, or scoping anywhere in the design."
        }
      ],
      "recommendation": "Implement a full API key lifecycle: scoped creation, server-side HMAC-SHA256 hashing, rotation policies, and immediate revocation propagation via Redis…"
    }
  ],
  "questions": [
    "What queue technology sits between the API gateway and the worker (SQS, Kafka, RabbitMQ, etc.)?",
    "…"
  ]
}

Notice how each finding carries explicit evidence kinds (quote, observation, inference) and a ranked severity - exactly the shape your contracts enforce, and exactly what makes the output routable into a PR comment or issue tracker without a second translation pass.

What did that run cost? On this toy doc, a full pipeline (planner + four specialists + synthesizer, using claude-sonnet-4-6) came to about $0.45. That’s reasonable for an occasional architecture review on a real design doc; it’s expensive if you run it on every small PR. Treat this as one data point - cost scales with document length, lens count, validation retries, and model choice - not a fixed price tag. It’s another reason to keep the topology lean and scope lenses deliberately.

The point isn’t the exact field names; it’s that the artifact can be routed into real workflows without a human reformatting it.


Failure Modes: Hallucinations, Conflicts, and Unknowns

Architecture review is a high-trust activity. When a human reviewer says “this will fail under load”, you can ask why, argue about assumptions, or request a benchmark plan. When a model says it, you get a different problem: the statement is often well-phrased but its epistemic status is unclear. Is it anchored in the input? Is it an inference? Is it a generic warning? If you don’t design for that, you’ll end up with a reviewer that either hallucinates confidently or hedges uselessly.

This section is a set of failure modes worth designing against up front.

Unsupported claims: make evidence a first-class field

The simplest guardrail is structural: require each finding to carry evidence. “Evidence” can be a quote, a reference to a section of the input, or a concrete observation about the described architecture. If the input does not contain enough evidence, the finding should not pretend otherwise - it should downgrade severity or convert into a clarifying question.

This one constraint changes behavior. Models are much less likely to invent specifics when they must attach them to an evidence slot. And when they do invent, it becomes visible: the evidence field will be empty, vague, or obviously unrelated.

Specialist disagreement: preserve dissent when it matters

Parallel specialists will disagree. Sometimes that’s a bug (one misunderstood the architecture). Sometimes it’s the point (tradeoffs are real). Synthesis should not always force consensus. A useful pattern is:

  • If the disagreement is resolvable by the input, resolve it and cite the evidence.
  • If the disagreement is resolvable by a missing fact, emit a question and present the conditional conclusions (“If X, then P0; if not, then P2”).
  • If it’s a genuine tradeoff, preserve dissent explicitly and explain the consequence of each choice.

The goal is not to “sound decisive”. It’s to help engineers decide with clarity about what hinges on what.

Unknowns: treat “needs human” as a valid outcome

Most model failures under architecture review are failures of uncertainty handling. The model would rather guess than admit it doesn’t know. Your contract should give it a safe place to put uncertainty: unknown, assumption, or needs_human fields that are treated as valid outputs, not errors.

This is also where you differentiate between “missing input” and “non-determinism”. Missing input can be fixed by asking a question. Non-determinism might require a benchmark, a threat model, or a human policy decision. Your reviewer should surface that explicitly instead of burying it in hedged prose.

Guardrails without over-engineering

You don’t need a full evaluation harness to be safer than the average demo. A few cheap guardrails go a long way:

  • Schema constraints: enums for severity/categories; required evidence fields; bounded list sizes.
  • Rubric checks: simple consistency rules (“P0 findings must include a clear blast radius and an action”).
  • Spot re-asks: targeted second-pass prompts when specific fields are weak (“rewrite evidence”, “justify severity”, “convert speculative claims into questions”).

The point is to fix predictable defects deterministically, not to create an open-ended “think harder” loop.

What to log (so you can debug)

If you deploy this, you want logs that help you answer: “which step failed, and how”? At minimum:

  • The input digest (so you can correlate runs without storing sensitive docs verbatim).
  • Planner output (selected lenses, questions, scoping decisions).
  • Each specialist’s structured findings (including validation failures/retries).
  • Synthesizer merge decisions (deduping and any conflict resolution notes).

With that, you can debug multi-agent runs like any other pipeline: identify the step that produced bad data, tighten the contract or prompt for that role, and move on.


Operational Heuristics: Prompt Pack and Debugging

If you treat your architecture reviewer like a one-off prompt, it will behave like one. If you treat it like a component in an engineering system, it becomes maintainable.

Build a “review prompt pack”

Instead of hand-editing prompts in code, keep a small prompt pack with:

  • The role definitions (planner, each specialist, synthesizer).
  • Your rubric snippets (what counts as severity P0/P1/P2, what categories you care about).
  • One or two output examples that demonstrate the contract “done right”.

This does two things. First, it creates a shared artifact for the team - people can review and improve it like any other engineering asset. Second, it makes drift obvious: if the output starts violating the rubric, you can update the pack instead of chasing ad-hoc prompt edits scattered through the code.

Version your contracts like an API

Once downstream systems depend on your schema, it becomes an API. Treat it that way:

  • Make breaking changes intentionally (renames, enum changes, required fields).
  • Consider adding a schema_version to the top-level review artifact.
  • Keep migration logic simple: prefer additive changes early, and prune later once consumers catch up.

Most failures in production-like agent systems aren’t “the model got dumber”. They’re “the contract moved and the assumptions didn’t”.

Debugging checklist: contract vs. reasoning vs. routing

When something goes wrong, you want a fast way to localize the problem:

  • Contract failure: validation errors, missing fields, wrong enum values. Fix with stricter schemas, clearer instructions, or repair prompts.
  • Reasoning failure: the model followed the schema but produced low-quality findings. Fix with better rubric, better lens prompts, and better evidence requirements.
  • Routing failure: the right work didn’t run (wrong lenses selected), or state was passed incorrectly. Fix the planner logic and the state model; don’t patch around it in specialist prompts.

This is why structured state passing matters: you can inspect each stage and see whether the pipeline is broken structurally or semantically.

Keep it lean: don’t add features until you feel pain

It’s tempting to add memory, retrieval (RAG), tool routers, and evaluation harnesses immediately. Most of that is premature for a reviewer that’s still proving it can produce a single reliable structured artifact.

Add only what fixes a named problem:

  • Add memory when you have multi-step interactions that truly benefit from long-lived context.
  • Add evaluation when you’re shipping frequent prompt/contract changes and need regression protection.
  • Add retrieval when your reviewer needs access to external specs, policies, or service inventories that are too large to paste into the input.

Extend one lens at a time

The clean way to extend this system is to add a specialist, not to bloat existing ones. If you want a “Compliance” lens, define:

  • A new role prompt for compliance.
  • The same output contract as the other specialists.
  • A planner rule for when to include that lens.

Because the contract is stable and synthesis already knows how to merge, you get extensibility without rewriting orchestration.


Closing: From Demo to Engineering Workflow

The pattern in this article is deliberately simple: contracts + roles + a boring topology. That combination is what turns “LLM feedback” into something you can actually integrate into engineering work.

Contracts are the differentiator. They force the reviewer to produce findings as data, not prose. Roles keep each agent honest: the planner scopes, specialists apply lenses, and the synthesizer merges into a single artifact. The topology stays lean so you can run it often and debug it when it misbehaves.

The practical question is where to plug this in. Architecture review is not a single event; it happens at different points in a system’s lifecycle. A structured reviewer can support a few common workflows:

  • Design reviews: run it on a design doc draft to surface missing assumptions and obvious risks before a meeting.
  • PRs for architectural changes: attach the structured artifact as a PR comment, with a short summary plus ranked findings.
  • ADRs: use the questions and “needs human judgment” fields to drive what the ADR must explicitly decide.

The point is not to replace humans. It’s to make the review loop tighter and more consistent - and to ensure the output is shaped like something your team can act on.

If you outgrow the lean version, the next steps are straightforward:

  • Add an evaluation harness with a small set of “golden” design docs and expected findings, so prompt/contract changes don’t regress silently.
  • Add organization-specific retrieval (policies, SLO templates, service inventories) when you repeatedly see “unknown” due to missing institutional context.
  • Expand lens coverage one specialist at a time, keeping the contract stable.

The best call to action is also the simplest: ship the smallest reviewer that returns structured findings with evidence. Run it on one real design doc. If the output is useful, you’ll know exactly what to improve next. If it isn’t, don’t add more agents - tighten the contract and the inputs until it becomes reliable.

Try the code: github.com/nunombispo/multi-agent-architecture-reviewer-article - clone, point run_review.py at your own design doc, and iterate on contracts and lenses from there.

Want to sharpen your Pydantic skills? This article leans on schemas as the backbone of agent reliability. For a full treatment of validation, serialization, and real-world Pydantic patterns, check out Practical Pydantic: The Missing Guide to Data Validation in Python on Leanpub.


Follow me on Twitter: https://twitter.com/DevAsService

Follow me on Instagram: https://www.instagram.com/devasservice/

Follow me on TikTok: https://www.tiktok.com/@devasservice

Follow me on YouTube: https://www.youtube.com/@DevAsService

Nuno Bispo

Nuno Bispo

Solutions Architect · Senior Python & AI Engineer · AI Audits · Helping teams fix what they shipped too fast
Netherlands