NTCTech

Posted on Jun 24 • Originally published at rack2cloud.com

Autoscaling Is an Authority System, Not a Capacity System

#devops #kubernetes #cloudnative #platform

Autoscaling authority is the condition most cloud operations teams have never formally defined. Every organization running Kubernetes has autoscaling configured. Almost none has treated those configurations as governance artifacts.

The Capacity Framing Is Wrong

Autoscaling authority governs what your execution plane is permitted to do under changing load conditions — and that framing is the one most engineering teams have never applied to it.

The standard framing is operational: autoscaling solves the overprovisioning versus underprovisioning tradeoff. Set a threshold, define bounds, let the system respond. That framing isn't wrong — it's just operating at the wrong level of analysis. Capacity is the visible output. Authority is the invisible decision structure underneath it.

Every autoscaling configuration answers a governance question before it answers a capacity question: what is this system permitted to do without asking a human? When an HPA scales a deployment from 3 to 30 replicas in response to a traffic event, no engineer approved that action. The execution plane acted under authority delegated at configuration time. That delegation is the architecture — and in most organizations, it's ungoverned.

Cloud architecture's central authority question is whether authority defined at the policy layer actually reaches the systems that execute work. Autoscaling is the most common place that question gets answered by accident.

What Autoscaling Actually Encodes

A scaling configuration is a policy artifact. Every value in it encodes a decision about what the system is authorized to do.

HPA target utilization: a decision. Min and max replica bounds: a decision. VPA update mode — Off, Initial, Recreate, Auto: a decision about whether the system may modify running pods without human approval. KEDA trigger sources and thresholds: decisions about what signals the execution plane is authorized to act on. Each of these encodes a judgment about permitted behavior. The problem isn't that the decisions are wrong. The problem is that they're almost never recognized as decisions at all.

They get written once, at initial deployment, calibrated to the workload as it existed at launch. Then they persist. Not because someone reviewed them and confirmed they were still valid. Because no one touched them, and defaults survive indefinitely unless deliberately revisited.

Silent Delegation

No architect signs a document stating: "The execution plane may increase application capacity by 500% without human review."

Yet that's exactly what many autoscaling policies authorize.

The delegation happened when the configuration was written. The authority persisted long after the original decision-maker stopped thinking about it. Teams don't experience this as a governance failure — they experience it as autoscaling working as designed.

Diagnostic: "Who authorized your execution plane to make this decision — and do they still work here?"

When Scaling Behavior Diverges From Intent

The named failure state for this pattern is Scaling Divergence: the condition where autoscaling behavior remains technically correct relative to configuration while becoming operationally incorrect relative to current intent.

Scaling Divergence doesn't announce itself. There's no alert for "autoscaling is no longer doing what you intended." The system continues functioning. Metrics look normal. Deployments scale. The divergence is only visible when you compare current scaling behavior against the workload model that originally justified the configuration — and most teams never make that comparison.

The clearest real-world scenario: a team sets HPA target utilization at 70% at application launch. At that point, the bottleneck is CPU — the application is compute-bound under normal load, and 70% is a reasonable ceiling before latency degrades. Nine months later, the application's bottleneck has shifted to database connection pool saturation. Response time is now dominated by wait time on external I/O, not CPU cycles.

The autoscaler continues making perfectly valid CPU-driven decisions. CPU utilization stays well below the threshold under rising load. The autoscaler holds replica count steady. Latency spikes. The operations team investigates CPU — the metric the autoscaler watches — and sees nothing wrong. The workload changed. The authority logic did not.

Three Failure Modes

Scaling Divergence manifests in three distinct patterns, escalating in severity and how long they typically go undetected.

01 — CEILING BLINDNESS

Max replicas were set at deployment time, but weren't derived from actual infrastructure capacity headroom. During a traffic event, the autoscaler hits the ceiling before load abates. Latency spikes. No one can confirm whether the ceiling is a deliberate safety boundary or an artifact of the original spec. The ceiling is enforcing something — nobody knows what.

02 — FLOOR DRIFT

Min replicas were set conservatively at launch to reflect early-stage load. The application has since scaled to 3× its original steady-state size. The min floor still reflects day-one sizing. During an incident-driven scale-down, the autoscaler hits the floor and holds there — at a replica count that hasn't matched actual minimum viable capacity in months. The floor feels like a safety net. It's a liability from an expired sizing decision.

03 — MULTI-CONTROLLER CONFLICT

HPA and VPA are running on the same workload without a defined coordination mode. VPA adjusts resource requests. HPA recalculates utilization ratios against the new requests. Each controller is operating correctly within its own decision boundary. The authority conflict is at the policy layer — two systems making decisions neither was designed to coordinate. Scheduler behavior becomes unpredictable and the root cause is invisible until you trace back to the configuration layer.

⚠ Common mistake: Running HPA and VPA on the same workload without setting VPA to Off or Initial mode. The default assumption is that both systems will self-coordinate. They won't. VPA modifies resource requests; HPA recalculates utilization ratios against those modified requests. The interaction is defined, but it's not intuitive, and the failure mode only surfaces under load.

What Autoscaling Governance Looks Like

If autoscaling is an authority system, it requires four properties to function as one.

01 — DECLARED INTENT — Scaling boundaries documented as decisions, not just values. Not "HPA target: 70%" in a YAML file — but what load profile that threshold was calibrated for, what behavior is expected outside that profile, and what workload characteristic it assumes. Without declared intent, there's no basis for evaluating whether current configuration is still valid.

02 — AUTHORITY OWNERSHIP — Who owns this scaling policy? Platform team? Application team? SRE? Cloud operations? If the answer is unclear, nobody owns it — which means nobody is accountable when scaling behavior diverges from intent. The ownership question determines who reviews the configuration when workload characteristics change, and who gets paged when a scaling event produces unexpected behavior.

03 — CHANGE COUPLING — Scaling configuration treated as code: changes to workload characteristics trigger review of scaling policy. The artifact is an explicit coupling between workload change events and scaling policy review. When the application's performance profile changes, the scaling authority that governs it should be reconsidered as part of the same change process, not discovered six months later during an incident.

04 — BEHAVIORAL AUDIT — Periodic comparison of actual scaling events against expected behavior under the declared intent. This is distinct from monitoring, which is reactive. Behavioral audit is deliberate review — examining whether the autoscaler is making decisions that align with the workload model the configuration assumed. Not "did it scale?" but "did it scale in the way the policy intended?"

The Operational Architecture Connection

Framework #152 Operational Authority Boundary defines the point at which authority must translate into executable operational behavior. Below the boundary: the execution plane acts as authority intends. Above it: execution proceeds independently.

Autoscaling sits directly at that boundary. Every scaling configuration is simultaneously a policy decision (above the boundary) and an executable instruction to the runtime (below it). The failure state — Authority Without Execution — typically manifests as systems that have policies with no enforcement. In autoscaling, the failure is the inverse: execution without current authority. The runtime is executing perfectly. It's just not executing what anyone still intends.

Autoscaling Authority Audit

Three questions that should be answerable for any autoscaling configuration in your estate:

1. What workload model does this configuration assume?
Not "what are the current values" — but what load profile, bottleneck type, and traffic pattern were used to derive those values. If nobody can answer this, the configuration is ungoverned regardless of whether it's functioning correctly.

2. When was this configuration last validated against actual workload behavior?
Not last modified. Last validated — meaning someone compared current scaling events against the intent the configuration was designed to express.

3. Who is accountable if this configuration produces scaling behavior that damages the application under load?
If that answer resolves to nobody, authority ownership is undefined. The execution plane is making decisions with no identified principal.

Architect's Verdict

Autoscaling is one of the most widely deployed authority systems in modern cloud infrastructure. Every scaling policy delegates operational decision-making to the execution plane. The question is whether the authority you intended actually reaches the runtime that makes decisions under load — and for autoscaling, it reaches the runtime once, at configuration time, and then persists until someone deliberately revisits it.

The operational failure most teams encounter isn't that autoscaling doesn't work. It's that it works exactly as configured, and the configuration no longer reflects anything anyone deliberately decided. That gap — between technical correctness and operational intent — is what Scaling Divergence names. It doesn't show up in scaling metrics. It shows up in incidents where the autoscaler performed as designed and the outcome was still wrong.

Every autoscaling system either has explicit authority or it has defaults. Defaults are simply authority decisions that survived long enough to become invisible.

Originally published at rack2cloud.com

DEV Community