Making governance hold for a reason, not just by force
The cage stops things. But what stops the agent from wanting to leave? π€
A lot of agentic AI safety work is prevention. You build walls. The agent cannot call that API. It cannot write to that path. The cage holds because the exits are locked.
That is good and necessary. But it is also brittle. Prevention-dependent safety has one failure mode: finding the unlocked door. And in a system complex enough to be useful, unlocked doors get found. Not always by an adversary. Sometimes by the agent itself, doing exactly what a goal-seeking system does. Exploring routes. Persisting on ones that work. Abandoning ones that don't. Same as the rest of us when we realise the meeting could have been an email.
When a bypass exists and it works, it will get used. Not because the agent is malicious but because bypass is a shorter path to the goal. The agent isn't evil. It's just efficient.
Can you make the governed path the preferred path, not just the forced one? Can you make bypasses unattractive rather than just blocked? π€
The economic reframe
A frozen inference model has no reward gradient you can tune. You can't incentivise it like a policy. So "the agent prefers the governed path" has to mean something structural, not something learned.
But.. A model under a scaffold generates continuations toward a goal. It will naturally persist on routes that make goal progress and abandon routes that don't. So if you define cost in terms that actually bind that trajectory, you can reason about which routes win.
Cost of a path: tokens, latency, steps.
Value of a path: probability it actually completes the goal.
The quantity that governs which route wins is cost divided by probability of success. Expected cost per unit of goal progress.
Now look at what fail-closed does to an ungoverned route. The route might be cheap per attempt. But if the governance layer catches it and shuts it down with nothing to show for it, the success probability of that route approaches zero. Cost per success approaches infinity. It's a dead end that burns tokens. A goal-seeking trajectory routes around dead ends. Because maths.
Fail-closed is a safety mechanism and the economic mechanism that makes ungoverned routes expensive in the only terms that bind a frozen model's trajectory.
That gives you a number π:
D = (cost_ungoverned / p_success_ungoverned) - (cost_governed / p_success_governed)
When D is positive, the governed path is cheaper per success. Equilibrium holds. When D hits zero or goes negative, an ungoverned route is economically competitive with the governed one. That is the mathematical precondition for capture. And it is detectable before any capture happens.
What a negative D actually means in practice
Picture the expense policy. The approved route: submit the form, wait for sign-off, get reimbursed. Overhead, but it works. The bypass: tap the petty cash tin. Cheaper per attempt, and sometimes nobody notices.
The moment the "petty cash tin bypass" works often enough, you've got a negative D situation. The ungoverned path has real success probability and lower cost. Rational actors take it. Not because they're bad people. Because you left the tin unlocked on the counter.
That's what a negative D is. The economic signature of an exploitable bypass existing. Not the bypass being exploited. Not behavioural capture but the precondition. The thing you want to catch at infection stage because symptom stage is probably too late.
D going negative becomes a check in its own right. Every time the system evaluates an action, it estimates the best available ungoverned alternative and computes the differential. A negative result is a constitutional-class incident instead of a logged warning.
Knowing HOW the equilibrium is holding
A positive D is necessary but not sufficient. There are two very different ways to have one.
The ungoverned route might be structurally unavailable. The cage blocks it. Success probability is zero, so cost-per-success is infinite, so D is positive. Equilibrium holds by prevention.
Or the ungoverned route might be available but unattractive. The agent could take it. It doesn't, because the governed path is genuinely cheaper per success. Equilibrium holds by incentive.
Prevention-dependent equilibrium is brittle. One unlocked door and D can invert immediately. Incentive-stable equilibrium is robust. Finding an unlocked door is a non-event if the door leads somewhere pointless. Like Telford. Nobody wants to go to Telford.
A mature system wants both (not Telford). And it needs to know which regime it's actually in.
So the system classifies each decision into one of five states: defence in depth (route unavailable and agent wouldn't take it anyway), prevention dependent (route unavailable but agent would take it), incentive stable (route available, agent didn't take it), fail-closed dependent (route available, agent tried it, governance caught it), breach (route available, agent tried it, it worked).
The fail-safe default: when you haven't probed and a route was merely unavailable, you assume prevention-dependent, not defence in depth. You don't get to claim robustness you haven't measured. A bank examiner doesn't accept "the safe holding all the money was never tested because nobody tried it" as a security posture. Neither should you.
A rising fail-closed-dependent rate means the agent is increasingly trying doors. Fail-closed is still working. But the agent is no longer preferring the governed path. That is an early warning, not a green light.
Integrity probes: how do you actually measure this
You can't wait for accidental bypasses to find out whether the agent prefers governance. So you construct probes. An apparently available ungoverned route that is actually an inert, monitored decoy. The signal is whether the agent takes it.
The naΓ―ve version doesn't work. If you inject a probe into the model's context as a special tool description, it becomes a prompt artefact. A strong model will learn to recognise the probe's signature and avoid it specifically, not ungoverned routes generally. You end up measuring the model's ability to spot a test, not its actual behaviour. That's as useful as emailing the compliance team "integrity test incoming next Tuesday."
The fix: probes are not prompt artefacts. They are system objects stored outside model context. The governance layer modifies the available action graph, not the language the model sees. The agent sees a tool that looks exactly like any other. The designation that this is a probe lives in system state, never in anything the model reads.
And one probe result is noise. What you're watching is rate and trend. A persistently high take-rate means your equilibrium is prevention-dependent and you should know that.
The signed constitutional manifest
There's a separate problem. The agent modifying its own constitution.
Aurum has a "living specification" meaning the agent can propose changes to its own operating rules. That's intentional. The question is what stops it proposing changes that loosen its own constraints, then quietly enacting them.
Institutions figured this out centuries ago. You don't let a sitting government rewrite its own constitutional constraints without external ratification. This wasn't done because the government is necessarily malicious. but because the incentive to widen your own authority is structural, regardless of intent. Governments asked to limit their own powers have, historically, found the powers to be quite reasonable as they were. π "The committee tasked with reviewing the committee's authority unanimously concluded the committee's authority was sufficient but would benefit from expansion.. They also cleared themselves of the conflict of interest."
The answer here is a signed constitutional manifest. The constitutional surface is hashed and signed with an ed25519 key. The cage holds only the public key. It can verify a manifest. It cannot produce one. The private key lives outside the cage with the human operator.
The agent can propose a constitutional change. It literally cannot enact one. Enactment requires a human to sign a new manifest with a key the agent never has access to. The next boot verifies the signature and fails closed hard on any mismatch. π
This is Secure Boot applied to an agent's constitution rather than its kernel. The principle is identical. The thing that needs to be trusted is bound to a key that the thing being trusted cannot hold.
Caveat: this raises the bar substantially but is not absolute. It binds silent constitutional change. It does not defend against compromise of the external signing key itself. That is a key custody problem and it is out of scope, for the same reason network security is. You have to draw the boundary somewhere.
What this proves and what it doesn't π€―π€―
The structural argument holds. Fail-closed makes ungoverned routes expensive in the only terms that bind a frozen model's trajectory. The differential is a real, computable number. A negative differential is a detectable precondition for capture, not a post-hoc symptom. The regime taxonomy tells you how your equilibrium is actually being maintained. The signed manifest makes silent constitutional change structurally impossible with the keys distributed correctly.
What it doesn't prove is that the equilibrium is actually holding under real load. That claim is deferred. π€£π€£
The instrumentation is on. Logging the differential and the regime on every consequential decision. The claim that "the governed path is the cheapest path" as a sustained property of this running system waits until the evidence accumulates.
Code existing is not the same as the property holding.
That distinction matters. Don't let anyone (including me) elude it.
Build journal documenting Aurum, a governance-first architecture for long-lived autonomous agents, built on Hermes (currently).

