Field report: first autonomous patch cycle across a 4,200-endpoint estate.

By Kosi Asuzu

filed by · operations desk spec-id · gc-fn-04 open / unrestricted

For six weeks beginning in late February, the agent ran an unattended patch reconciliation cycle across 4,200 endpoints, twelve sites, and three converged PSAs. This is what we measured, the two incidents it surfaced, and where the loop tightened.

Estate composition

The estate spans a regional managed-services portfolio: 4,212 endpoints at the start of the cycle, distributed across twelve physical sites and a fully cloud-resident core. Three previously independent PSAs had been converged into a single twin model in the preceding quarter. Identity is consolidated under a single IdP with two break-glass exceptions. Backup posture is mixed across two vendors.

Backlog at start

The opening posture was rough. The twin reported 38,041 outstanding KBs across the endpoint population, with 11.2% of endpoints in a non-compliant patch state for at least 90 days. The longest-outstanding KB had been deferred for 412 days. There were 47 endpoints in a partially-rolled-back state from a prior automation run. Drift across configuration domains averaged 4.8% per node, manageable at the per-node level, expensive in aggregate.

┌──────────────────────────────────────────────────────────────┐
│ CYCLE 01 · OPENING POSTURE                          [BASELINE] │
├──────────────────────────────────────────────────────────────┤
│ ENDPOINTS                  4 212                             │
│ SITES                         12                             │
│ OUTSTANDING KBS           38 041                             │
│ >90D NON-COMPLIANT         11.2%                             │
│ PARTIAL-ROLLBACK STATE        47 ENDPOINTS                   │
│ AVG CONFIG DRIFT            4.8% / NODE                      │
│ IDENTITIES                 5 884   (1 IDP · 2 BREAK-GLASS)   │
└──────────────────────────────────────────────────────────────┘

Cycle 01: baseline reconciliation

The first cycle was deliberately restrained. The agent ran inventory and posture reconciliation only (no remediation) and produced a ranked queue of changes by blast radius and severity. The full reconciliation took 38 minutes wall-clock. The queue surfaced 14,210 high-confidence remediations, 8,920 medium, and the rest tagged for operator review.

Cycle 02: first-pass remediation

From day three, the agent began executing high-confidence remediations against the rehearsed sandbox first, then against production. By end of week one: 9,012 KBs reconciled, 1,840 configuration drifts corrected, 47 partially-rolled-back endpoints brought back to a known-good state. Mean time to remediation, weighted by severity, was 4 minutes 12 seconds. Industry baseline for the same operations under operator-driven RMM is 27 minutes.

The aggregate effect of small reductions in MTTR is not linear. It is the difference between a backlog that grows and one that shrinks.

Two incidents

Incident 04-01. On day eleven, a planned KB rollout to a finance subnet was halted by the rehearsal layer. The rehearsal predicted a temporary auth disruption to a payments service that shared an identity dependency with the patched endpoints. The rehearsal was correct: the dependency was real, undocumented, and would have caused a window of failed transactions. The agent paused, escalated, and the change was scheduled for an off-hours window with a tighter rollback envelope. No production impact.

Incident 04-02. On day twenty-eight, a backup-vendor connector returned silently degraded data for six hours before the connector heartbeat flagged it. The twin's confidence score for backup posture dropped during that window, the agent suspended any backup-related changes, and the operator was paged on the heartbeat alarm. Total exposure: zero changes executed against stale data. Zero production impact. The connector was patched and re-baselined within the cycle.

Numbers at end of cycle

┌──────────────────────────────────────────────────────────────┐
│ CYCLE 02 · 6-WEEK SUMMARY                          [CLOSED]   │
├──────────────────────────────────────────────────────────────┤
│ KBS RECONCILED            36 902   (97.0% OF OPENING)        │
│ DRIFTS CORRECTED          11 740                             │
│ >90D NON-COMPLIANT          0.4%   (WAS 11.2%)               │
│ MTTR · WEIGHTED          04M 12S   (BASELINE 27M)            │
│ AUTONOMY SLA               99.94%                            │
│ INCIDENTS                       2   (ZERO PRODUCTION IMPACT) │
│ OPERATOR CONSENTS              28   (MOSTLY NOVEL CHANGES)   │
└──────────────────────────────────────────────────────────────┘

Post-cycle audit

The audit pass at the end of week six took 11 minutes wall-clock and reproduced the change-log from the event store. Every executed change was attributable, every rehearsal was retrievable, every consent was timestamped. The two incidents both fit cleanly inside the existing incident framework: one as a correctly-paused change, one as a correctly-suspended domain. Neither required a postmortem in the traditional sense; both became feedback into the rehearsal model.

The team's headcount did not change during the cycle. Their working hours did. By end of cycle, no one was patching anything by hand, and the shape of the on-call rotation had quietly inverted: most pages were now policy questions, not operational ones.

Field report: first autonomous patch cycle across a 4,200-endpoint estate.

Estate composition

Backlog at start

Cycle 01: baseline reconciliation

Cycle 02: first-pass remediation

Two incidents

Numbers at end of cycle

Post-cycle audit

More from the desk.

Rehearsing changes against a live twin: blast radius, simulation, and consent

Building the digital twin: from telemetry stream to live estate graph

The labour arithmetic of IT operations, and why automation alone never closed the loop