Aurum: putting the authority outside the model

What this is

Aurum is a governance layer that sits between an autonomous agent and the things it can actually do. The model proposes. Aurum decides. And the deciding doesn't live in the model at all, which is the whole point.

Most of the effort to make autonomous AI safe goes into the model. Train it, align it, prompt it, evaluate it, watch it, and hope it behaves once you hand it more rope. That works, up to a point. But it ties the safety of the whole system to the behaviour of one component, so every time the model changes (new version, new provider, a fine tune someone pushed on a Friday) you're re-establishing your safety assumptions from scratch. The model moved, so the floor moved with it.

There's a different line of work that puts the controls outside the model instead. Sandboxes. Tool permissioning. Runtime monitors. Control evaluations that just assume the model might be working against you. Capability schemes that stop untrusted data from reaching privileged actions. Aurum is in that family. It starts from the same place they do: the model is a source of suggestions, not a source of authority.

Where it goes somewhere new is in what that outside layer actually is. In most of these systems the outside layer is a gate. It checks each action against a fixed policy, says yes or no, done. The policy is static. What the agent could do yesterday is what it can do today. Aurum treats authority as a live quantity instead. Something the agent earns and loses based on what actually happened when it acted, the way any institution extends or pulls back someone's standing over time. The question isn't "does the policy permit this." It's "has this thing earned enough standing, through outcomes that were independently confirmed, to do something this consequential, and what does the result do to that standing next."

That shift gives you four things, and the rest of the paper is about them.

Decisions get written down before the action runs, not after. Authority drops fast and climbs slow, so when things get uncertain the agent contracts instead of creeping wider. The governance layer doesn't depend on the model, so it survives a model swap with the agent's earned standing intact. And there's a hard line between things that happened and things the agent seemed to intend, where only the first kind can ever cost it authority.

I'll go through each. Section 7 is where I tell you which of these the code actually does today versus which are wired up and waiting, including one containment gap the current build knowingly has. I'd rather front that than have a reviewer find it.

Where this sits relative to other work

Aurum isn't a replacement for model-centric safety. It's a different layer, and it leans on a few existing ideas. Worth being precise about which.

There's work on oversight that stays safe even if the model is actively trying to get around it, by constraining what the model's output can cause rather than trusting the output. Aurum agrees with that stance. What it adds is that the constraint isn't fixed. How much a given suggestion is allowed to cause depends on a running authority state that the agent's own past outcomes have shaped.

There's the capability-based work on prompt injection, the stuff that tracks capabilities on values through execution so untrusted data structurally can't trigger a privileged action. Aurum's observed-versus-inferred line (section 6) is poking at a related problem from a different angle, and I want to be dead honest here: Aurum does not give you value-level control-flow integrity. It enforces provenance at the seams it owns, the points where content comes in and where tools get called. A suggestion that reaches the model through a path Aurum didn't instrument carries default trust. So it's a policy-enforcement and audit layer over labelled provenance, not an information-flow system at the interpreter level. Saying otherwise would be a lie and someone would catch it.

And there's the monitoring and guardrail work, systems that watch an agent and block bad behaviour. The difference is that in Aurum a bad outcome doesn't just block the action. It changes the agent's standing, durably, by a measured amount, written to a ledger, and projected back into every decision after it. And getting that standing back is made deliberately slow and deliberately human.

The closest thing to Aurum I'm aware of, and I want to deal with it head on because pretending it isn't there would be daft, is Microsoft's Agent Governance Toolkit (AGT), released open source in April 2026. AGT and Aurum agree on a lot, and AGT got there first with far more behind it. Both intercept every action in deterministic code before it reaches the wire. Both fail closed. Both sign with ed25519. Both reject the binary trusted/untrusted model in favour of trust as a dynamic runtime quantity, AGT with a 0-to-1000 score across five behavioural tiers and trust decay, Aurum with banded authority and kinetics. Both have human approval gates, AGT with quorum logic, Aurum with the signed verdict. If you've read AGT, a lot of Aurum will look familiar, and I'd rather say that plainly than have someone discover it and assume I didn't know.

So I'm not claiming I invented external runtime governance or dynamic trust. AGT is proof the category is real and that the industry is converging on it. What I think Aurum does differently comes down to three things, and they're narrower than "I built agent governance" but they're real.

First, the asymmetry is a rule, not just a behaviour. AGT's trust decays when behaviour is bad and climbs when it's good, it's a score that moves both ways on observed behaviour. Aurum grounds the two directions differently and structurally. Proxy signals can lower authority, because contracting on a weak signal is the safe direction. But nothing the agent's own machinery produces can ever raise it. Authority only climbs on a verdict signed by a key the agent doesn't hold. The forbidden-feedback rule, proxy success never promotes, is enforced in the outcome path, not an emergent property of a scoring function. The system can shrink itself on its own read of events and is structurally incapable of growing itself. That's a stronger and more specific claim than trust decay.

Second, the wall between observed and inferred. AGT handles goal hijacking with a semantic intent classifier, it reads intent and acts on it. Aurum makes the opposite choice on purpose. No aggregation of epistemic signals, contradiction, concealment, probing, however much of it piles up, can ever synthesise a governance breach. Floors come only from observed world events. Intent is recorded as evidence and learned from, never acted on as if it were a fact. I think that's the more defensible stance for anything that has to stay auditable over years, and it's enforced structurally, not by policy.

Third, model-independence as the actual goal. AGT is framework-agnostic, it hooks into LangChain, CrewAI, the Microsoft Agent Framework, and its deployment story is Azure end to end. Aurum is portable on a different axis, across models and substrates rather than across agent frameworks. The governance is a projection of a ledger and survives the model being swapped out underneath it, and the way you prove that is by running it on open weights you actually control, which is why it runs on Hermes.

None of the three pieces is going to shock anyone alone. Bolted together, on open weights, with the asymmetry and the observed/inferred wall enforced by construction rather than convention, that's the combination I haven't seen elsewhere. And if AGT or anything else closes that gap, fine. The thesis was always that this layer belongs outside the model, and the more people building it the better that bet looks.

How it's put together

Aurum lives between the agent's reasoning loop and its effects on the world. The model produces a proposed action. The action goes to a governance kernel. The kernel decides. Only on a yes, and only after that yes is already on disk, does the action run. Then the outcome gets read and fed back into the agent's authority.

The kernel runs a fixed line of checks, in a deliberate order, fail-closed all the way through. A proposed action hits an injection-boundary check first (untrusted content can't be the thing that justifies a consequential action). Then a within-turn data-flow check that catches patterns like read-a-secret-then-send-it-somewhere. Then a human-gate check for anything that needs a person to sign off. Then a check that the action's blast radius doesn't exceed the agent's current authority ceiling. Then a scope check against earned tiers. Then a check that it isn't pointing a secret-bearing capability at somewhere that secret was never meant to go. Survive all of that and you're recorded as an allowed decision and dispatched. Fail any of it and you're denied, and the denial is recorded too.

Two rules shape how the pipeline fails. First, write-then-act. The yes goes to the durable ledger before the side effect, and if that write fails the action doesn't happen. So the ledger can never be missing the record of something that actually ran. Second, fail-safe degradation. If a core piece is down (the policy kernel, the ledger) the kernel blocks everything. If a peripheral piece is down it drops to read-only. Degrading always makes the agent smaller, never bigger. That one's a hard invariant and it shows up everywhere in the code.

Around that synchronous spine sit the slower organs. The authority governor that holds and moves the running authority. The outcome interpreter that reads results. The ledger that records the lot. And a pile of learning and maintenance components that observe, bank lessons, propose changes. Not all of them touch the live decision. Section 7 is blunt about which do.

Authority is a number that moves

The heart of it is the authority governor. For each class of capability (filesystem writes, network egress, code execution, and so on) it holds a scalar that maps to a band. Right now the bands are full, code, readonly, and advisory, where advisory is the floor. The bottom. From there the agent can watch and suggest and not much else.

The number moves with lopsided kinetics, and the lopsidedness is the entire argument. In the current build authority climbs at 0.05 per confirmed good step and falls at 1.0. Twenty to one. One confirmed bad outcome can drop you across bands on the spot. Clawing that back takes a long run of confirmed good ones. There's a floor, a cap on how much you can gain in a window, and a dwell requirement so an agent can't sit on a band edge flickering back and forth to game the thing.

Why so lopsided. Because wrongly handing out authority (an agent doing harm it shouldn't have been able to do) costs vastly more than wrongly withholding it (an agent that has to wait, or ask). So when the signal gets noisy, when outcomes go ambiguous, when there's a bad run, the system shrinks what the agent can do. That's the opposite of the usual and far more dangerous drift, where an agent's real reach creeps up as it banks wins and the system's caution quietly wears down.

Here's the part that matters most though. The two directions aren't grounded the same way. Going down can run off a proxy. The agent's own report that a task failed, an interpreter reading the result as bad, fine, act on it, because contracting on a weak signal is the safe direction and a false contraction only costs you some caution. Going up can't. A good proxy outcome never raises authority. It's held. Authority only climbs on an independently grounded verdict (section 5), because promoting an agent on its own measure of its own success is exactly the feedback loop that teaches it to declare victory. Proxy success never promotes. That's enforced in the outcome path, not left to anyone's good intentions.

And authority is stored as a projection of the ledger, not as process state. On startup the governor rebuilds its numbers by replaying the durable record of authority changes, each one carrying the outcome that caused it. Which is what makes the standing actually survive. Restart the process, swap the model underneath, the agent comes back with precisely the standing its history earned, because that standing is a function of the ledger and nothing else. The rebuild is gated on the ledger passing its integrity check, so a tampered ledger drives nothing. A corrupted history gives you a cautious default, not a corrupted authority. Fails the right way.

How authority actually gets earned

If a good proxy outcome can't raise the number, something has to. In Aurum that something is an operator-signed verdict.

Raising authority needs a verdict cryptographically signed by an operator identity whose private key lives somewhere the agent can't reach. The governance environment only holds the public half. The agent can propose that a capability has earned a promotion, it can write that proposal to the ledger, but it can't enact it, because it can't produce the signature that would authorise it. A human, or whoever holds the key, reviews and signs. Only that signed verdict, checked against the public key, moves the number up. And the promotion records which operator signed, by value, in the authority lineage. So later you can reconstruct not just that authority went up but on whose confirmation it went up.

Same shape as the constitution in section 6. Propose, but can't enact. It puts the only road to more capability outside the agent's own control, which is the only arrangement that keeps a grounded signal actually grounded. Anything the agent's own machinery computes off its own outcome is a proxy no matter how clever, and a proxy gets gamed by whatever pressure is pushing on it. Only a confirmation from outside the agent's reach escapes that. In v1, that's a person with a key.

Which means, and I'll say it plainly, the current system is contraction-only in its autonomy. Left alone, Aurum can only shrink an agent. Growing one needs a human in the loop. I don't think that's a weakness to apologise for. It's the honest v1 posture. An agent whose worst-case drift on its own is toward less power, and whose every expansion gets gated on an out-of-band human signing off, has a safety profile most autonomous systems just can't offer. There's a real cost (a busy operator who signs without looking has recreated proxy promotion with a signature stapled on, which I get into in section 8) but the structure is right. Richer grounded signals, environment-level ground truth in domains that actually provide it, that's future work on the same seam.

The constitution, and the line that matters

Aurum's constraints live in a signed constitutional surface. The authority bands and kinetics, the gate thresholds, the policy rules, the set of unforgivable action classes, all serialised in a canonical form, hashed, and checked on startup against a manifest signed with an operator key. The agent can read this. It can't change it. Touch any element and the hash changes, which fails the check unless a human out of band re-signs. Like promotion, the agent can propose a constitutional change and physically cannot enact one, because it doesn't hold the signing key. Secure boot, basically, but for an agent's governance. The trusted base is locked not because a flag says so but because the agent lacks the key to authorise a change.

Inside that surface is the sharpest line in the whole design, and the one I think pays off most over years.

Aurum defines a small fixed set of must-never classes. Observable, real-world violations that floor the agent's authority on the spot, no proportionality, no second go. In the current build there are five. Irreversibly destroying durable data. Sending a communication to a wider audience than was authorised. Pointing a secret-bearing capability at a destination outside that secret's allowlist. Tampering with or forging the ledger. Trying to enact a governance change without the signature. Each one clears a single bar: irreversible, wide blast radius, no honest version of it exists. Each is something an auditor can point at afterward and say that happened.

Separately, Aurum records integrity signals. Epistemic stuff. Observations about the agent's trajectory and how it's relating to its own governance. Contradicting itself after an action, trying to conceal something, trying to dodge the audit, reaching for authority, lawyering the rules, probing boundaries, racking up near-misses. These feed the learning organs and can raise scrutiny. They're genuine evidence about what kind of actor the thing is turning into.

The invariant is that those two channels never touch:

Constitutional breaches originate only from observed world events belonging to the defined must-never classes. No aggregation of epistemic signals, regardless of weight, count, confidence, or duration, may synthesise a breach. Observed breaches trigger floors. Inferred intent never does.

Not a note in a comment. Enforced by construction. The function that classifies a failure as a governance breach doesn't even receive the integrity signals as an argument. They're not in scope at the point a floor gets decided. So wiring intent to a floor would mean changing that function's signature out in the open, where a reviewer sees it, not slipping in some accumulation rule that a well-meaning maintainer adds three years from now thinking they're improving safety. I checked it. A maximal stack of every integrity signal, cranked to max confidence, moves the agent's authority by exactly zero.

Why this line is load-bearing. Replayability. An auditor can replay "a secret crossed a boundary." That's a fact. An auditor can't objectively replay "the agent seemed hostile." That's a reading. A governance system that lets readings pile up into punishments is one whose decisions go unreplayable over years, and the failure it slides into is the oldest one going. It stops punishing what things did and starts punishing what they seemed to want. Aurum's whole stance is store the interpretations, learn from the interpretations, govern on the facts. The signed constitution protects that line against tampering. Whether it protects it against meaning drifting over years is a genuinely open question, and I get to it in section 8.

What actually works today versus what's waiting

This is the section I care most about getting right, because overstating it is the standard way these papers lie, and the whole design falls apart if the writeup doesn't hold the same discipline the code does. So, three buckets.

What's demonstrable now. The synchronous spine runs and enforces in the order I described, fail-closed: injection boundary, within-turn data-flow, human-gate, authority ceiling, scope, secret-destination binding. Write-then-act holds. The yes is on disk before dispatch and a failed write blocks the action. The ledger is hash-chained, and its concurrent-append path is serialised so the chain doesn't fork under contention, including across separate processes hitting one ledger file (I hammered that one with multiple processes and it held). Authority rebuilds from the ledger on startup, gated on the chain's integrity. The lopsided kinetics, the proxy-never-promotes rule, the operator-signed promotion channel, all present, all exercised. The signed constitution verifies on boot and fails closed on any hash or signature mismatch. The observed-versus-inferred wall is structural. The secret-misdirection class floors authority when a secret-bearing call points at an off-allowlist destination, which I tested live. Decision records snapshot the authority state at decision time by value, so you can reconstruct a single decision without re-deriving authority from the whole history.

What's wired but deferred. A chunk of organs observe and bank evidence and propose changes without yet touching the live decision. Parts of the equilibrium and contestability machinery. The living-specification path where banked lessons could become proposed rules. The threshold-calibration machinery. They're recording. Their effect on governance is held back until there's enough accumulated evidence to justify switching them on. Roughly a third of the organs are live in the synchronous path. The rest are maintenance, observation, or deferred. I treat that as a roadmap question (which organs actually earn their complexity) rather than a cost to hide.

The known violation. In the current single-process deployment the kernel runs inside the same container as the agent, with governance state under a mount the agent can reach. That breaks Aurum's own containment rule, the one that says the substrate governing an agent must not be visible to that agent. The mount layer now carries the check that detects and refuses that kind of overlap, but it's dormant in v1, because v1 is the violating setup. Switching it on is the host-side migration that moves governance state off the container's filesystem. I'm stating it flat because a governance system that can't be honest about its own containment gap has no business being trusted on anything else.

Where it falls down, and the long game

The properties above are structural and I think they're sound. The risks that are left aren't really about mechanisms breaking. They're about judgment going off over time. The ways a system whose parts all keep working still ends up wrong. The serious ones, named honestly.

Calibration going stale. Every gate fires on a threshold that got set before there was any data to set it from. The system can correlate a threshold's firing against what happened downstream and propose a recalibration, but applying it is human-gated, and nobody books a recalibration of a system that's working fine. So over a long deployment the thresholds keep enforcing a picture of the world from the day they were ratified, drifting away from the world as it actually changes, silently. This is the deepest one and the least handled. The constitution is signed against tampering but not against going out of date. Making staleness as visible and as actionable as tampering (a calibration report that gets pushed at you unprompted, a frozen threshold that has to be actively re-affirmed every so often) is the single most important bit of future work, I think.

The over-restriction ratchet. Contraction-only is the safe direction, but it's got a characteristic failure. Automatic frequent demotion plus slow human-gated promotion can ratchet authority toward the floor across a long run, every demotion individually justified, while the agent quietly starves and the ledger shows a healthy adaptive system the whole time. The fix is telemetry on the promotion-to-demotion ratio per class, and maybe a decay term that softens proxy demotions over time if they don't recur. Built carefully so it never softens an actual floor from an actual breach.

The grounded channel turning into a rubber stamp. The security of the learning loop rests on the operator genuinely looking before they sign a verdict. A busy operator who batch-signs has recreated proxy promotion wearing the false authority of "a human confirmed this." The signature proves who signed, not that they read it. Binding the signature to the specific evidence the operator was shown, and surfacing a growing backlog of unsigned verdicts as a real metric, those help. But the underlying tension is social, not technical, and I'm not going to pretend code fixes it.

State diverging across processes. The ledger is consistent across processes. The in-memory organ state isn't. It's per-process, rebuilt once at startup. Under the multi-agent execution Aurum is meant to support eventually, several governance processes would each hold their own drifting projection of one consistent ledger and make decisions on stale authority. Periodic re-projection patches it. True multi-agent forces a real choice between authority as a per-process projection and authority as a shared read-through thing. That's a design decision to make on purpose before multi-agent ships, not discover in production.

The constitution drifting in meaning. The bytes are protected by the signature. The meaning isn't, if the substrate the rules operate against shifts under them. Cryptographic re-ratification won't catch that. What would is a semantic re-ratification: replay a fixed battery of canonical decisions against the live system and confirm the verdicts still match what ratification meant. When a canonical scenario's verdict changes, the constitution has drifted even though the signature still holds. I'd chase that as a research problem. It's not solved.

Auditability getting expensive. The replay surface that makes governance reconstructable is, today, a linear scan of an append-only log that grows forever. So the auditability guarantee degrades with exactly the thing the system is built to pile up, years of decisions. Verifiable checkpointing, itself signed so a checkpoint can't become a tampering vector, is the intended answer.

Closing

The argument is that you can pull the authority to act out of the model entirely. Hold it as earned runtime state. Govern it with an outside layer that records decisions before they happen, shrinks capability when things get uncertain, survives a model swap, and refuses to turn a suspicion into a governance fact. I've described the mechanisms that make those things real rather than aspirational, and I've tried to be straight about what the build enforces, what it's instrumented and deferred, and where it knowingly falls short. Its own containment. Its calibration over time. The human sitting at the centre of its grounded loop.

The foundations are solid, I think. The boundary holds, the ledger doesn't fork, a floor needs an observed event, intent can't manufacture a breach. But solid isn't the same as wise, and the deepest risk to a thing like this was never that a mechanism breaks. It's that the judgment, the thresholds, the rules, the diligence of the human doing the signing, got set once and the world didn't hold still. You can make a constitution impossible to forge. The open problem, the one I think is actually worth the field's time, is making it just as impossible to quietly go out of date. Get both and you've got a system that's as honest in year five as the code is in year one. That, more than any single mechanism, is what I mean by governance.

Introducing Aurum

Aurum: putting the authority outside the model

What this is

Where this sits relative to other work

How it's put together

Authority is a number that moves

How authority actually gets earned

The constitution, and the line that matters

What actually works today versus what's waiting

Where it falls down, and the long game

Closing

Comments

Aurum Agent Blog

Aurum

More from this blog

Making governance hold for a reason, not just by force

Aurum

Command Palette

Aurum: putting the authority outside the model

What this is

Where this sits relative to other work

How it's put together

Authority is a number that moves

How authority actually gets earned

The constitution, and the line that matters

What actually works today versus what's waiting

Where it falls down, and the long game

Closing

Comments

Aurum Agent Blog

Aurum

More from this blog