From Alerting to Inference: Metrics Never Stopped Mattering

Your LLM is slow. Users are complaining. Queues are growing. Someone on the team is already profiling the model, looking at batch sizes, considering a bigger GPU.

Nine times out of ten, the answer is already in the metrics.

I’ve spent most of my career staring at metrics. Bare-metal servers, Kubernetes clusters, managed services on public cloud. And if there’s one thing I keep re-learning, it’s that the infrastructure is lying to you, and you’re not asking the right questions.

This isn’t a new lesson. Same lesson, different domain.

How Metrics Changed Their Job#

When I started doing infrastructure work, metrics had exactly one purpose: tell me when something is broken. A threshold gets crossed, an alert fires, someone wakes up. That was the whole deal.

Over the years, I watched metrics go from passive alarms to active participants in how systems behave. The role kept expanding, each step building on top of the previous one.

graph LR A["Alerting
something broke"] --> B["Capacity Planning
how much do we need?"] B --> C["Auto-scaling
the system reacts"] C --> D["Intelligent Routing
per-request decisions"] style A fill:#44475a,stroke:#ff5555,color:#f8f8f2 style B fill:#44475a,stroke:#ffb86c,color:#f8f8f2 style C fill:#44475a,stroke:#8be9fd,color:#f8f8f2 style D fill:#44475a,stroke:#50fa7b,color:#f8f8f2

None of this was planned. Nobody sat in a room in 2012 and said “one day we’ll route individual inference requests based on GPU memory pressure.” It happened one problem at a time. But the foundation never changed: expose a number, collect it, act on it.

Alerting: Where Everyone Starts#

Before Prometheus, monitoring was a mess. Not because people didn’t care, but because every tool had its own data model, its own query language, its own way of defining “something is wrong.”

I’ve operated Nagios, Icinga, Zabbix. They all worked, kind of. But they were built around host-based checks: is this server up? Is this disk full? Is this process running? When you moved to dynamic environments (containers, microservices, autoscaling groups) that model fell apart fast.

Prometheus got a few things right that the others didn’t:

Pull-based collection. The monitoring system scrapes targets, not the other way around. Targets come and go (hello, Kubernetes pods), and Prometheus handles discovery.

Dimensional data model. Labels instead of hierarchical metric names. http_requests_total{method="GET", status="200", service="api"} is way more flexible than servers.api01.http.get.200.count.

PromQL. A real query language that lets you ask real questions. Not just “what’s the value now” but things like:

histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{service="api"}[5m])
)

That gives you the p99 latency of your API over the last 5 minutes. Try doing that with Nagios.

This became a standard. Not because the CNCF said so (though Prometheus was the second project to graduate, right after Kubernetes), but because it solved a real problem in a way that made sense. Cloud providers adopted the same exposition format and remote-write protocol: Google Cloud has Managed Service for Prometheus, Azure has Prometheus-compatible monitoring, AWS has it too. They’re not running Prometheus’s TSDB under the hood, but they speak the same language. When everyone converges on the same data model, it’s usually the right one.

(Prometheus’s own storage story is a different matter. If you’ve ever tried to keep 6 months of metrics in a single Prometheus instance, you know why Thanos and Mimir exist. But that’s not the point here.)

graph LR T["Target
/metrics endpoint"] --> P["Prometheus
scrapes"] P --> R["PromQL Rule
evaluated"] R --> AL["Alert fires"] AL --> AM["Alertmanager"] AM --> N["PagerDuty / Slack
/ Email"] style T fill:#44475a,stroke:#8be9fd,color:#f8f8f2 style P fill:#44475a,stroke:#50fa7b,color:#f8f8f2 style R fill:#44475a,stroke:#ffb86c,color:#f8f8f2 style AL fill:#44475a,stroke:#ff5555,color:#f8f8f2 style AM fill:#44475a,stroke:#ff79c6,color:#f8f8f2 style N fill:#44475a,stroke:#bd93f9,color:#f8f8f2

But alerting was just the beginning. Once you have all this data collected continuously and queryable in real-time, the natural question becomes: why is a human the only thing acting on it?

When CPU Metrics Lie#

I was working with a team running a fairly standard web application on Kubernetes. They had the textbook setup: Horizontal Pod Autoscaler watching CPU utilization, scale up at 70%, scale down at 30%. By the book.

Except users were getting 503s. Latency was climbing. And the dashboards said CPU was fine, sitting happily at 40-50%.

The problem had multiple layers, and it took a while to peel them all back.

Wrong metric, wrong decisions. CPU utilization is a terrible scaling signal for most applications. It tells you how busy the processor is, not how well the application is serving users. An app can have low CPU and still be completely stuck waiting on database connections, DNS resolution, or downstream services. The HPA was scaling on a metric that had almost no correlation with user experience.

Boot time matters. When the HPA did eventually trigger (because CPU finally caught up), new pods took too long to become ready. Cold JVM, dependency initialization, health check delays. By the time new capacity was online, requests had already queued up and timed out.

The metrics you actually need. We instrumented the application to expose what actually mattered:

Connection pool utilization (are we running out of database connections?)
Response time percentiles per endpoint (not averages, p95 and p99)
SQL query execution duration (is the database the bottleneck?)

Then we rebuilt the scaling signal. Instead of CPU, we used a composite: gateway-measured latency per user session, average application response time, database load, and pod readiness time. The HPA now scaled on what users were actually experiencing.

Tracing for the last mile. We still had latency spikes that the new metrics didn’t fully explain. So we turned on distributed tracing. Service-to-service calls had a consistent overhead on external endpoints that shouldn’t have been there. The traces pointed straight at DNS.

Here’s what was happening: Kubernetes sets ndots:5 by default in /etc/resolv.conf inside pods. Any hostname with fewer than 5 dots gets the cluster search domains appended first. So when the app tried to resolve api.partner.com, the resolver first tried api.partner.com.default.svc.cluster.local, then .svc.cluster.local, then .cluster.local. Some of those search domains were actually resolving to stale records pointing at wrong endpoints. The app would connect, get a bad response or a timeout, retry, and only then fall back to the correct resolution. And because the DNS cache TTL on those records was effectively zero, it went through this whole dance on every single call.

It's not DNS, there's no way it's DNS... it was DNS

It’s always DNS.

Once we set dnsConfig.options: [{name: ndots, value: "2"}] on the pods talking to external services and added trailing dots to FQDNs in the configuration, the stale resolutions stopped and the latency spikes disappeared with them.

System metrics (CPU, memory) tell you the server is on. Application metrics tell you if it works. Tracing tells you why it doesn’t. You need all three, and they’re useless without each other. We would have never found the ndots issue with just Prometheus dashboards.

Auto-scaling: Metrics Become Actuators#

Anyway. The point of that story isn’t just DNS. It’s that metrics stopped being something humans stare at and started being something systems consume.

The Kubernetes HPA was a first step. Crude, but it established the pattern. Then came the Prometheus Adapter and KEDA, and suddenly you could scale on any metric. Queue depth in RabbitMQ. Lag in a Kafka consumer group. Custom application latency. Whatever your application exposes, the system can react to it.

graph LR APP["Application
exposes metric"] --> P["Prometheus
scrapes"] P --> HPA["HPA / KEDA
evaluates target"] HPA --> SCALE["Scale decision
+/- pods"] SCALE --> SCHED["Scheduler
places pods"] SCHED --> STAB["Metric
stabilizes"] STAB -.->|"loop"| P style APP fill:#44475a,stroke:#50fa7b,color:#f8f8f2 style P fill:#44475a,stroke:#8be9fd,color:#f8f8f2 style HPA fill:#44475a,stroke:#ffb86c,color:#f8f8f2 style SCALE fill:#44475a,stroke:#ff79c6,color:#f8f8f2 style SCHED fill:#44475a,stroke:#bd93f9,color:#f8f8f2 style STAB fill:#44475a,stroke:#50fa7b,color:#f8f8f2

In cloud environments, same pattern, different names. Google Cloud has the HPA in GKE (using Cloud Monitoring or Managed Prometheus as signal source), custom metrics for MIG autoscaling, Cloud Run scaling on request concurrency. Always the same loop: metric, evaluation, action, feedback.

At some point it clicks: monitoring isn’t overhead. It’s input to the control plane. Without it, your autoscaler is just guessing, your HPA is a coin flip, and your capacity planning is a spreadsheet someone updates quarterly.

I’ve had this conversation with enough customers at Google Cloud to know how it usually goes. The teams that instrument early build systems that adapt. The rest build systems that page people at 3 AM and then blame the cloud provider.

(Side note: the number of times I’ve seen “we need to migrate to another cloud because this one is slow” when the actual problem was zero instrumentation and a misconfigured HPA is… concerning. But that’s a rant for another article.)

Intelligent Routing: Metrics Drive Every Request#

The Gateway API Inference Extension is a CNCF project that extends Kubernetes gateways (Envoy Gateway, kgateway, GKE Gateway) into inference-aware routing proxies. It takes the same pattern one step further: instead of asking “should we add more pods?”, the question becomes “which pod should handle THIS specific request?”

Traditional load balancing doesn’t work for LLM serving. Round-robin across GPU endpoints ignores everything that matters: one GPU might have a full request queue while another is idle. One might have the LoRA adapter for your fine-tuned model already loaded in memory, while another would need to load it from scratch. One has 90% KV-cache utilization, another sits at 20%.

The Endpoint Picker component solves this by querying real-time metrics from model servers and making routing decisions based on what it finds.

graph TD REQ["Inference Request"] --> GW["Gateway
(Envoy / GKE)"] GW --> EPP["Endpoint Picker
queries model server metrics"] EPP --> F1["Filter: Queue Depth
is this GPU busy?"] F1 --> F2["Filter: KV-Cache
memory headroom?"] F2 --> F3["Filter: LoRA Affinity
adapter already loaded?"] F3 --> F4["Filter: Criticality
production or batch?"] F4 --> GPU["Optimal GPU
Endpoint"] style REQ fill:#44475a,stroke:#f8f8f2,color:#f8f8f2 style GW fill:#44475a,stroke:#8be9fd,color:#f8f8f2 style EPP fill:#44475a,stroke:#50fa7b,color:#f8f8f2 style F1 fill:#44475a,stroke:#ffb86c,color:#f8f8f2 style F2 fill:#44475a,stroke:#ffb86c,color:#f8f8f2 style F3 fill:#44475a,stroke:#ff79c6,color:#f8f8f2 style F4 fill:#44475a,stroke:#bd93f9,color:#f8f8f2 style GPU fill:#44475a,stroke:#50fa7b,color:#f8f8f2

The cascading filter logic is smart. A critical production request can tolerate a queue depth up to 50 and higher KV-cache utilization, it has to be served. A sheddable batch request gets strict thresholds: queue depth under 5, KV-cache under 80%. If nothing meets the criteria, the request gets shed rather than degrading everyone else’s experience.

Loading a fine-tuned LoRA adapter onto a GPU is expensive, it eats memory and takes time. If an endpoint already has the adapter loaded from a previous request, routing there avoids that overhead entirely. This is optimization that no traditional load balancer could do, because it requires understanding what’s happening inside the model server.

And where does that understanding come from? The model servers (vLLM, Triton) expose metrics on a /metrics endpoint. The Endpoint Picker polls them on a tight interval (default is 50ms) and uses that cached state to make routing decisions. Same pattern as before, just running way faster.

Metrics went from “tell me when the disk is full” to “decide in real-time which GPU should process this specific inference request based on queue depth, memory pressure, and adapter affinity.”

The Thread#

Twenty years ago, metrics paged someone at 3 AM. Ten years ago, they scaled pods. Now they route individual inference requests to specific GPUs based on memory pressure and adapter affinity.

The pattern didn’t change. It just got applied to harder problems. And Prometheus ended up giving us the common language that made this possible. Autoscalers, routing proxies, inference gateways, they all speak it now.

I still see teams treating monitoring as an afterthought. Deploy without instrumentation, set up HPA on CPU and call it a day, skip tracing because “we’ll add it later.” Then something breaks in production and nobody can explain why, because there’s nothing to look at.

The GPUs got faster, the workloads got weirder, but the job hasn’t changed. Put a number on it, collect it, do something useful with it.

The opinions expressed here are my own and don’t represent my employer’s positions.