22 Jun 2026 9 min read Python

PydanticAI vs LangChain - Choosing an Agent Framework for Production, Not Demos

In a recent audit, a team showed me an AI assistant they'd built on top of their company knowledge base. The demo had landed well: ask how to use a feature, and it walked through the exact pain point their support queue kept seeing. Leadership signed off.

In production, the same agent told a user to open a menu option that didn't exist. Not a vague answer - a specific UI path, stated with confidence. Nobody caught it in testing. It surfaced when I audited the system, not when a user complained.

The prototype passed testing because nobody was checking whether the answer matched the product. In production, that gap becomes a liability: the model invents UI paths, and your backend has no schema to reject them.

When you're choosing an agent framework, popularity is the wrong scorecard. Pick the one that fails loudly in development and gracefully in production - or you'll find out in audit.

What "Production-Ready" Actually Requires

Tutorial agents are built to impress in a fifteen-minute demo. Production agents run unattended, handle bad inputs, and ship answers your backend has to trust. The gap between those two goals is where most teams stumble - and it's rarely visible until something reaches a user.

When I audit agent codebases, I evaluate five things the tutorials skip:

Structured, validated outputs: Can your system reject an invented menu path before it becomes user-facing advice?
Dependency injection for testing: Can you swap the knowledge base for a mock in CI without rewiring the agent?
Retry and error handling: When the model returns malformed output, does the framework retry - or do you ship a parser exception?
Observability hooks: Can you trace which document grounded a bad answer when support escalates?
Type-checker support: Will static analysis catch a breaking API change before deploy, or after the agent silently misbehaves?

If you want to score your own system, the Production Readiness Audit covers the same five categories - deployment, observability, failure modes, and a prioritized remediation plan.

Side-by-Side: The Same Agent, Two Frameworks

The first item on the rubric is structured, validated outputs. The clearest way to see the framework difference is to build the same agent twice.

The task: answer natural-language questions about a CSV of sales data. The agent calls a tool to query the file, then returns a structured answer your API can pass downstream without a second parsing step.

LangChain

from langchain.agents import create_agent
from langchain.tools import tool

@tool
def query_sales_csv(region: str) -> str:
    """Return total revenue for a region in the sales CSV."""
    total = df.loc[df["region"] == region, "revenue"].sum()
    return f"{region}: ${total:,.0f}"

agent = create_agent("anthropic:claude-sonnet-4-6", tools=[query_sales_csv])
result = agent.invoke({
    "messages": [{"role": "user", "content": "What was Q1 revenue in Europe?"}],
})

answer = result["messages"][-1].content  # str — you validate the shape yourself

This is the pattern most tutorials teach. The tool works, the agent runs, the demo looks fine. But answer is a string (or occasionally a dict, depending on the model). Nothing in this flow checks that the response contains a real region name, a numeric revenue, or the right currency. If the model formats the answer as prose instead of data, your code finds out in production - or in audit.

LangChain does support a response_format parameter with Pydantic models. It's opt-in, and most teams I audit haven't wired it up yet.

PydanticAI

from pydantic import BaseModel
from pydantic_ai import Agent

class SalesAnswer(BaseModel):
    region: str
    total_revenue: float
    currency: str = "USD"

agent = Agent("anthropic:claude-sonnet-4-6", output_type=SalesAnswer)

@agent.tool_plain
def query_sales_csv(region: str) -> float:
    """Return total revenue for a region in the sales CSV."""
    return float(df.loc[df["region"] == region, "revenue"].sum())

result = agent.run_sync("What was Q1 revenue in Europe?")
answer = result.output  # SalesAnswer — validated before your code runs

Here, validation isn't a step you add later - it's the contract. output_type=SalesAnswer tells the agent what shape to return. If the model produces something that doesn't match - wrong field, missing revenue, invented region - PydanticAI raises before your application code touches it. You get a SalesAnswer object your type checker understands, not a string you hope to parse.

Same task, same tool, same model. The difference is what happens after the LLM responds: LangChain hands you text and trusts you'll validate it; PydanticAI hands you a typed object or fails immediately.

Dependency Injection & Testability

Validated outputs tell you the shape is right. Dependency injection tells you the data is right - and lets you prove it without calling a live API on every CI run.

Agent tools don't operate in a vacuum. They read from databases, knowledge bases, and internal APIs. In production, those dependencies are real. In tests, they need to be fake - predictable, fast, and free. The question is whether your framework makes that swap explicit or forces you to hack around it.

PydanticAI: dependencies as a first-class parameter

PydanticAI declares what an agent needs via deps_type. Tools receive a RunContext and pull dependencies from ctx.deps. At run time, you pass the real implementation; in tests, you pass a fake.

from dataclasses import dataclass
from pydantic_ai import Agent, RunContext

@dataclass
class SalesDataSource:
    def revenue_for(self, region: str) -> float:
        return float(df.loc[df["region"] == region, "revenue"].sum())

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    deps_type=SalesDataSource,
    output_type=SalesAnswer,
)

@agent.tool
def query_sales_csv(ctx: RunContext[SalesDataSource], region: str) -> float:
    return ctx.deps.revenue_for(region)

# Production: agent.run_sync(prompt, deps=SalesDataSource())
# Test:       agent.run_sync(prompt, deps=FakeSalesData(revenue=1_250_000))

The type checker enforces the contract. If a tool expects SalesDataSource and you pass something else, mypy catches it before merge. Your test injects FakeSalesData(revenue=1_250_000) and asserts the agent's structured output matches - no CSV file, no network, no API key in CI.

LangChain: it works, but the seams are yours to find

LangChain agents can be tested, but the framework doesn't give you an injection point. The usual pattern is a module-level dependency the tool closes over, then unittest.mock.patch in tests:

from unittest.mock import patch
from langchain.agents import create_agent
from langchain.tools import tool

data_source = SalesDataSource()  # module-level — no framework injection point

@tool
def query_sales_csv(region: str) -> str:
    """Return total revenue for a region in the sales CSV."""
    return f"{region}: ${data_source.revenue_for(region):,.0f}"

agent = create_agent("anthropic:claude-sonnet-4-6", tools=[query_sales_csv])

# Test: patch the module where data_source lives
with patch("myapp.sales_agent.data_source", FakeSalesData(revenue=1_250_000)):
    result = agent.invoke({"messages": [{"role": "user", "content": "Q1 Europe?"}]})

Here you're patching a string path - "myapp.sales_agent.data_source" - that must match exactly where the module is imported. Rename the file, change the import structure, or run tests in parallel that share a patched global, and you get flakes or false greens.

If you've fought flaky agent tests, you've lived this. The test doesn't fail because the agent logic is wrong; it fails because the test setup is fighting the framework's defaults.

PydanticAI doesn't eliminate the need to write tests. It gives you a seam that was designed for swapping. That's the difference between "we test our agents in CI" and "we test our agents in CI reliably."

Error Handling and Retries in Practice

When the model returns garbage, what happens next? In production, "garbage" isn't always obvious - it's a well-formed JSON object with total_revenue: "approximately high" or a region name the CSV doesn't contain. The framework should catch that and recover, not pass it to your API.

PydanticAI: validation failures feed back to the model

When output_type is a Pydantic model, schema violations don't reach your application code. PydanticAI sends the validation error back to the model and retries: