Why Reflection Fails for AI Agent Structured Output

Many AI Agent tutorials propose the same fix for bad output: reflection. Your agent generates garbage JSON? Just add another LLM call to “review” it. The second call critiques the first, the first tries again, and voilà — quality improves. It seems clean, elegant, and academic.

Well, I’ve shipped agents to production at a large-scale web company — systems that generated deployment configs, API payloads, database queries. And I can tell you from painful experience: reflection doesn’t work for structured output. Not reliably, and not when it actually matters.

Here’s what happens in practice. Your agent generates JSON. It’s wrong about a third of the time, with missing fields, wrong types, and violated business rules. You add a reflection step because that’s what the tutorials say. Now it fails one in six times.

This sounds like progress until you realize that those remaining failures are invisible. The reflection step said “looks good!” and waved them through. You’ve built a system that’s confidently wrong, and you won’t know until something breaks in production at 2am on a Saturday.

I spent weeks debugging this loop before I found a pattern that actually works. It’s embarrassingly simple, it gets me near-perfect correctness, and it doesn’t require any clever reflection prompts. Let me show you.

Prerequisites

To get the most out of this article, you should be familiar with:

Basic Python (functions, dictionaries, type hints)
How LLM APIs work at a high level (sending a prompt, getting a completion back)
What a JSON Schema is (you don’t need to be an expert — the code explains itself)

The Problem with Reflection

My take: asking an LLM to critique another LLM’s structured output is like asking someone who’s bad at math to grade someone else who’s bad at math. They’d likely have the same or similar blind spots. The same weights that produced the error are now being asked to detect the error. Why would they suddenly get it right on the second pass?

Think about what you’re actually asking the model to do during a reflection step. “Hey, look at this JSON you just generated. Does timeout_seconds need to be less than interval_seconds? Are the replicas and CPU limits consistent with the business rules I listed in the system prompt?”

The model reads it over, pattern-matches against what “looks right,” and says “yep, all good.” It missed that constraint during generation. It’s going to miss it during review too, because it’s the same model doing the same kind of reasoning.

The failure mode that kept biting me wasn’t wrong output — it was approved wrong output. False positives. The reflection step says “this configuration is correct” when it absolutely isn’t.

A system that says “I failed, try again” is annoying but safe. A system that says “this is correct” when it’s broken? That’s the config that sails through your pipeline and takes down your service. That’s a 2am page.

Reflection works beautifully for open-ended stuff — improving the tone of an email, catching logical gaps in an essay, suggesting a better structure for a blog post. But for structured output with hard constraints? You need something that doesn’t guess. You need something deterministic.

The Fix: Deterministic Validation

The pattern for the fix is dead simple:

Generate → Validate with a real validator → Feed exact errors back → Retry.

That’s it. No second LLM call to “critique.” No chain-of-thought reasoning about correctness. Just a function that returns true or false with specific error strings — the same kind of validator you’d write for a form submission or an API request.

Here’s the key insight, and honestly it’s the whole article in one sentence: LLMs are excellent at fixing errors when you tell them exactly what’s wrong. They’re terrible at finding their own errors.

When you tell a model “your output had these specific errors: timeout_seconds must be < interval_seconds, replicas > 5 requires cpu_limit >= 1.0”, it fixes both on the next try almost every time.

The fixing is trivial. The finding is the hard part. And with this technique, you’re outsourcing that to a deterministic function that’s perfect at it, every time, in microseconds. There are no hallucinations and you don’t get “confident but wrong” responses. Just pass or fail with an exact reason why.

What the Validator Actually Catches (and Why LLMs Can’t)

A deterministic validator checks errors at three levels, and each one exploits something LLMs are fundamentally bad at:

1. Structural errors

Is the output even valid JSON? Are all required fields present? Are types correct (string vs. integer vs. array)? JSON Schema handles this in microseconds.

An LLM “reviewing” the same output might glance at the structure and say “looks like valid JSON” without actually parsing it. The validator parses it. There’s no “looks like”. It either passes or it doesn’t.

2. Constraint violations

Is replicas within the allowed range of 1–20? Does service_name match the regex ^[a-z][a-z0-9-]*$? Is memory_limit_mb at least 128?

These are boundary checks. LLMs are notoriously bad at precise numerical comparisons and regex matching. They approximate, while a validator evaluates them exactly.

3. Cross-field business rules

This is where reflection fails hardest. Rules like “if replicas > 5, then cpu_limit must be >= 1.0” or “timeout_seconds must be strictly less than interval_seconds” require holding two values in mind and applying a specific logical relationship.

These rules don’t exist in the training data as patterns the model can pattern-match against. They’re your rules, specific to your system. The LLM has no reason to “know” them beyond what’s in the prompt, and prompts get lost in long contexts.

Here’s why the validator wins at all three: it doesn’t reason — it executes. There’s no interpretation, attention window, or chance of skipping a constraint because something earlier in the context was more salient. Every rule runs every time, in order, deterministically.

The LLM’s job, by contrast, is to generate: to produce something that looks right based on patterns. That’s a fundamentally different skill than verifying that every constraint in a spec is satisfied. You wouldn’t ask a novelist to proofread a tax return. Don’t ask a generator to validate its own output.

The Code

Here’s the full pattern in LangGraph: the validator, the nodes, and the graph with conditional routing. The complete runnable example — schema, validator, the loop, and tests — is on GitHub: github.com/manishramavat/langgraph-deterministic-validation

First, the schema and the validator — this is your real source of truth:

from jsonschema import validate, ValidationError

DEPLOYMENT_CONFIG_SCHEMA = {
    "type": "object",
    "required": ["service_name", "replicas", "resources", "health_check"],
    "properties": {
        "service_name": {"type": "string", "pattern": "^[a-z][a-z0-9-]*$"},
        "replicas": {"type": "integer", "minimum": 1, "maximum": 20},
        "resources": {
            "type": "object",
            "required": ["cpu_limit", "memory_limit_mb"],
            "properties": {
                "cpu_limit": {"type": "number", "minimum": 0.1, "maximum": 8.0},
                "memory_limit_mb": {"type": "integer", "minimum": 128, "maximum": 16384},
            },
        },
        "health_check": {
            "type": "object",
            "required": ["path", "timeout_seconds", "interval_seconds"],
            "properties": {
                "path": {"type": "string", "pattern": "^/"},
                "timeout_seconds": {"type": "integer", "minimum": 1},
                "interval_seconds": {"type": "integer", "minimum": 5},
            },
        },
    },
}

# The validator: your REAL source of truth. This is the hard part.
def validate_config(config: dict) -> tuple[bool, list[str]]:
    """Schema validation + business rules. This IS your spec."""
    errors = []
    try:
        validate(instance=config, schema=DEPLOYMENT_CONFIG_SCHEMA)
    except ValidationError as e:
        errors.append(f"Schema: {e.message} (at {list(e.path)})")
        return False, errors  # bail early — no point checking rules on broken structure

    # Cross-field rules that JSON Schema can't express
    if config["replicas"] > 5 and config["resources"]["cpu_limit"] < 1.0:
        errors.append(f"replicas={config['replicas']} requires cpu_limit >= 1.0")
    if config["health_check"]["timeout_seconds"] >= config["health_check"]["interval_seconds"]:
        errors.append("timeout_seconds must be < interval_seconds")

    return len(errors) == 0, errors

Now the LangGraph loop that wires generation to that validator:

import json
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

SYSTEM_PROMPT = (
    "You generate deployment configs as valid JSON. "
    "Required fields: service_name, replicas, resources, health_check. "
    "Follow ALL constraints exactly. Return ONLY valid JSON, no commentary."
)

class AgentState(TypedDict):
    user_request: str
    generated_config: dict
    validation_errors: list[str]
    attempts: int
    success: bool

def generate_node(state: AgentState, llm: ChatOpenAI) -> AgentState:
    messages = [SystemMessage(content=SYSTEM_PROMPT)]
    if state["validation_errors"]:
        error_msg = "\n".join(state["validation_errors"])
        messages.append(HumanMessage(
            content=f"Previous attempt failed validation:\n{error_msg}\n\n"
                    f"Fix ALL errors. Request: {state['user_request']}"
        ))
    else:
        messages.append(HumanMessage(content=state["user_request"]))

    response = llm.invoke(messages)
    try:
        config = json.loads(response.content)
    except json.JSONDecodeError as e:
        config = {}
        state["validation_errors"] = [f"Invalid JSON: {e}"]
        state["attempts"] += 1
        return state

    state["generated_config"] = config
    state["attempts"] += 1
    return state

def validate_node(state: AgentState) -> AgentState:
    if not state["generated_config"]:
        return state
    is_valid, errors = validate_config(state["generated_config"])
    state["validation_errors"] = errors
    state["success"] = is_valid
    return state

def should_continue(state: AgentState) -> str:
    if state["success"]:
        return "done"
    if state["attempts"] >= 3:
        return "done"
    return "retry"

def build_graph(llm: ChatOpenAI) -> StateGraph:
    graph = StateGraph(AgentState)
    graph.add_node("generate", lambda s: generate_node(s, llm))
    graph.add_node("validate", validate_node)
    graph.set_entry_point("generate")
    graph.add_edge("generate", "validate")
    graph.add_conditional_edges(
        "validate",
        should_continue,
        {"done": END, "retry": "generate"}
    )
    return graph.compile()

The loop is exactly what you’d expect: generate, validate, and if it fails, feed the exact errors back and retry. The model sees precisely what went wrong and fixes it. After at most three attempts, you either have a valid config or a clear failure you can log and handle.

Why This Works So Well

The reason this pattern succeeds where reflection fails comes down to one thing: the validator is a specification, not an opinion.

When the LLM gets back “timeout_seconds must be < interval_seconds”, that’s not a suggestion from another language model that might be wrong. That’s a deterministic check that ran against your actual business rules. The model doesn’t need to wonder if the feedback is correct. It is correct, by definition.

This also means the errors are actionable. “Your output looks inconsistent” (the kind of feedback a reflection step gives) requires the model to figure out what “inconsistent” means in this context. “replicas=8 requires cpu_limit >= 1.0, but you gave cpu_limit=0.5” tells the model exactly what to change. One number needs to go up, or the other needs to go down. That’s a trivial fix.

The retry loop also naturally handles the long-tail cases. Most requests succeed on the first attempt. A small fraction needs one retry. Rare edge cases might need two. Three attempts covers the vast majority of real-world inputs without burning tokens on unnecessary reflection calls.

When Three Attempts Isn’t Enough

In practice, three attempts handles the vast majority of cases. But there are situations where you might want to adjust:

Complex schemas with many interdependent rules: The model may need an extra attempt to get everything right simultaneously. Consider bumping to four or five attempts.
Ambiguous user requests: If the input itself is underspecified, the model may be making valid-but-wrong interpretations. The fix here isn’t more attempts — it’s clarifying the request before generation.
Consistently failing on the same rule: If you see the same error across multiple attempts, that’s a signal your system prompt isn’t clearly explaining that constraint. Fix the prompt, not the retry count.

The retry count is a tuning parameter, not a magic number. Watch your logs. If you’re regularly hitting the attempt limit, something upstream needs fixing.

When to Use This (and When Not To)

This pattern is the right tool when your output has hard constraints that must be satisfied exactly:

Deployment configs
API request payloads
Database query parameters
Form data with validation rules
Any structured output where “close enough” causes downstream failures

It’s overkill — or simply the wrong tool — when your output is open-ended:

Writing tasks (summaries, emails, blog posts)
Code generation where correctness is harder to define mechanically
Creative generation where variation is desirable

For those cases, reflection actually works well. The quality of an essay is genuinely a matter of judgment, and a second LLM pass can add real value. But the moment you have a spec with enumerable constraints, stop asking the model to grade its own homework and write a validator instead.

The Takeaway

Reflection is a useful technique, but it’s the wrong tool for structured output. It turns hard failures into invisible ones, and invisible failures are the most dangerous kind in production systems.

The pattern that actually works is simple: generate, validate deterministically, feed exact errors back, retry. LLMs are good at fixing errors. They’re bad at finding them. Stop asking them to do the hard part, and let a deterministic function do what it’s built for.

Write the validator. It’s the most important function in your agent.