Most software gets worse over time. Technical debt accumulates. Edge cases multiply. The codebase becomes a fossil record of abandoned intentions.

Agentic systems can be different. If designed correctly, they can get better over time — not through magical self-improvement, but through a disciplined engineering loop that turns every failure into a permanent fix.

We call this compound engineering.

The loop

The compound engineering loop has four stages:

1. Observe

Every agent interaction with reality produces data. Not just “did it work?” — but how it worked. What context did it receive? What tools did it use? What reasoning did it follow? Where did it hesitate?

Most teams log outcomes. Few log the reasoning path. But the reasoning path is where the signal is.

2. Evaluate

Raw observations need to be evaluated against criteria. Not just accuracy — but also latency, cost, user satisfaction, and downstream effects.

The key insight is that evaluation criteria should be defined before the system is built. If you don’t know what good looks like, you can’t recognize it when you see it.

3. Diagnose

When something goes wrong — and it will — the question isn’t “what happened?” but “why did the system think this was the right thing to do?”

This is where visible reasoning pays off. If you can trace the agent’s decision path, you can identify the root cause: bad context, wrong tool selection, missing guardrail, or genuine model limitation.

4. Harden

The diagnosis becomes a permanent improvement. A new guardrail. A refined tool. Better context architecture. An updated evaluation criterion.

The critical property of this step is permanence. The improvement must be structural — embedded in the system’s architecture — not a prompt patch that might drift or be overwritten.

Why it compounds

Each cycle through this loop doesn’t just fix one problem. It improves the system’s ability to handle classes of problems. A guardrail added for one edge case often catches ten others. A tool redesigned for one use case often serves three.

Over time, the system develops what we call operational intelligence — not artificial general intelligence, but deep, specific competence in its domain.

The discipline requirement

Compound engineering doesn’t happen automatically. It requires discipline:

  • Structured logging from day one, not retrofitted later
  • Regular review cycles where observations are analyzed, not just archived
  • An engineering culture that treats agent failures as learning opportunities, not embarrassments
  • Investment in evaluation infrastructure that makes it easy to assess quality at scale

The teams that adopt this loop consistently outperform those that don’t — not because their models are better, but because their systems are learning from every encounter with reality.

That’s the real promise of agentic systems: not that they’re smart out of the box, but that they get smarter with every day of operation.