The labour arithmetic of IT operations, and why automation alone never closed the loop.

By Kosi Asuzu

filed by · operations desk spec-id · gc-fn-01 open / unrestricted

Estates are doubling every five years. The humans operating them are not. Every generation of automation promised relief; every generation stopped one step short of the action. Here is the arithmetic, and the reason it cannot be solved by hiring or by better tooling alone.

The growth curve no one is matching

Walk into any mid-market IT operations team and ask what the estate looked like five years ago. The answer is reliably the same shape: half the endpoints, a third of the cloud surface, a quarter of the identities, none of the SaaS sprawl, and a fraction of the regulatory perimeter that sits on top of all of it. Then ask what headcount looked like five years ago. The answer, almost without exception, is about the same.

This is the arithmetic. Estate complexity scales with revenue, with M&A, with cloud adoption, and with regulatory load. Operations headcount scales with hiring budgets, training pipelines, and the patience of CFOs. The two curves diverged a long time ago. The gap between them is currently being filled with overtime, alert fatigue, and the quiet attrition of senior operators who have decided they would rather not do this for another year.

Three generations of automation, and what each one shipped

Each wave of automation tooling tried to bend that gap. None of them closed it.

Scripts and runbooks. The first wave: PowerShell, Bash, Ansible, scheduled tasks. They externalized institutional knowledge into files, which was a real gain. But scripts only run when something (a human, a cron, a ticket) tells them to. The decision stayed with the operator.

RMM and orchestration. The second wave: condition-triggered automation, playbook engines, "if this alert then run that script." Better. The operator is no longer in the trigger path for simple cases. But the playbooks are static, the conditions are brittle, and the moment reality drifts from the assumed shape, the playbook either fails closed or, worse, succeeds with the wrong outcome and nobody notices for a quarter.

AIOps. The third wave: pattern recognition over telemetry, alert clustering, anomaly detection, root-cause hints. This produced excellent decision-support. It also produced the durable industry meme of the operator with seventeen tabs open, all of them telling her the same thing in slightly different words. AIOps refines the question. It does not answer it.

Each wave moved the bottleneck closer to the action without ever reaching it. The action stayed manual.

Where the loop kept breaking

The pattern is consistent. Each generation of tooling shrank one part of the operator's job and left a different part untouched. Scripts removed typing. RMMs removed the trigger. AIOps removed the triage. None of them removed the moment where a human reads a recommendation, makes a judgement about blast radius, and clicks execute.

That moment is where labour cost actually lives. It is also where the operator's most valuable skill (situational awareness of how this change will interact with the rest of the estate) gets used. And it does not scale, because the estate it is being applied to does scale.

The missing primitive

The reason the loop never closed is that none of these systems had a model of the estate that an autonomous agent could reason against. Scripts have no model; they have parameters. RMMs have a static playbook graph. AIOps has a stream of alerts and a learned topology over them, which is closer, but still not a model in the operational sense.

What is needed is a live, high-fidelity, queryable representation of the estate as it is right now: every endpoint, identity, configuration, dependency, change-history entry, and state transition, reconciled continuously against ground truth, and rich enough that an agent can ask what happens if I do X and get a non-trivial answer before X is allowed to touch reality.

That primitive is the digital twin. It is the only thing that turns a playbook engine into a judgement engine.

Autonomy is not faster automation

Faster automation means the same playbooks running with less human latency in the trigger path. That is incremental and useful and largely solved. Autonomy means the system is responsible for the decision, not just the execution. It chooses which change to execute, against which subset of the estate, in which order, with which rollback envelope, and on whose authority, and it owns the post-state.

The human is on the consent path for novel or high-blast-radius decisions, not on the execution path for routine ones. The arithmetic finally bends because the per-change human cost stops being constant.

What changes, operationally

The deliverable is not a chat interface or a copilot. It is a closed loop: estate → twin → agent → rehearsal → action → reconciliation → estate, running continuously, with a small number of well-defined places where a human is asked for explicit consent. Most days, that consent looks more like reviewing a pull request than approving a ticket.

The operator's job changes shape, toward policy, exception-handling, and institutional design. The headcount stops needing to track the estate growth curve. The arithmetic finally works.

The labour arithmetic of IT operations, and why automation alone never closed the loop.

The growth curve no one is matching

Three generations of automation, and what each one shipped

Where the loop kept breaking

The missing primitive

Autonomy is not faster automation

What changes, operationally

More from the desk.

Building the digital twin: from telemetry stream to live estate graph

Rehearsing changes against a live twin: blast radius, simulation, and consent

From AIOps clustering to autonomous execution: closing the last mile