Skip to main content
📡 Production Engineering · June 11, 2026 · 11 min read

AI Observability in Production: What to Measure, What to Ignore

Most AI teams measure metrics that don't matter and stay blind to ones that do. This article maps four observability layers you must cover in production — and which metrics are safe to skip.

By Emil Slavin, Enterprise Architect & AI Strategist

Why this is hard

In a classical web system, observability is logs + metrics + traces. In an AI system you need all that plus one more layer: answer quality. The LLM returned a response, latency was fine, status was 200 — but the answer was a hallucination. What does your monitoring tell you about that?

Here are four layers you must cover, and in what order.

Layer 1: Performance (latency, throughput)

What to measure:

  • TTFT (Time To First Token) per-tenant per-LLM-provider. p50/p95/p99.
  • TTLT (Time To Last Token) — end of streaming response.
  • Retrieval latency separately — pgvector / Pinecone query time.
  • Reranker latency separately if you use one.
  • Function call latency if Agentic.

Goal: a single LLM-provider outage shouldn't bring everything down. You need to see who's causing the issue and route around them.

Layer 2: Quality (correctness)

The hardest layer. No perfect automatic metric.

What to measure:

  • Refusal rate: what fraction of responses ended in "I can't answer that" (indicates weak RAG or recurring hallucinations).
  • Citation rate: what fraction of responses include a source citation (must be 100% in regulated domains).
  • Average sources per response.
  • User thumbs up/down rate. Yes it's noisy, but the trend says something.
  • LLM-as-judge: sample 5% of responses and pass them to a different LLM with an evaluation prompt. Not perfect, but it catches degradation patterns.
  • Hallucination detection: pattern matching on phrases that indicate "I'm making things up" (domain-specific).

Layer 3: Cost

Without it, one day you get a $50K bill from OpenAI and have no idea what happened.

What to measure:

  • Cost per request, per-tenant, per-LLM-provider. Histogram.
  • Tokens in / tokens out separately.
  • Embedding API cost (yes, this too).
  • Alerts on anomalies — one tenant suddenly burning 10x.
  • Daily/weekly cost trend per-tenant.

Layer 4: Drift

The model doesn't change (if you pinned the version). But user inputs change and the knowledge base changes. Output can slowly drift without anyone noticing.

What to measure:

  • Topic distribution: what's being asked this week vs last month.
  • Avg query length, vocabulary diversity — shifts indicate a change in user composition.
  • Top retrieved chunks histogram — which documents are being pulled. If this changed dramatically without a KB update, there's a problem.

Tools in the market (2026)

  • LangSmith — the de facto standard for LLM-specific tracing. Rising fast.
  • Helicone — a proxy that sits between your code and OpenAI/Anthropic. Zero code changes.
  • OpenTelemetry + custom spans — for environments that need to meet enterprise observability standards.
  • Arize / WhyLabs — for drift and quality, but overkill for most deployments.

At SLAtech: OpenTelemetry with custom spans on every LLM call. Exports to Grafana and LangSmith in parallel. Two observers — if one fails, you've still got the other.

What not to measure

  1. GPU utilization if you use an LLM API. That's the provider's problem.
  2. LLM "memory" — it's stateless. Not relevant.
  3. Generic uptime ratio — 99.99% of an LLM provider tells you very little. Quality regressions don't show up here.
  4. Total request count without a tenant/provider breakdown — a giant number with nothing to do about it.

Alerting that wakes you up at night

Hard thresholds only on:

  • p95 TTFT > 5sec — provider broken.
  • Citation rate < 80% in a regulated domain — RAG isn't working.
  • Cost-per-request × 3 within an hour — runaway tenant or successful prompt injection.
  • Refusal rate > 20% — RAG returning empty too often, knowledge base problem.

Everything else — dashboards, not paging.

What to take away

  1. 4 layers: performance, quality, cost, drift. Without quality, you're monitoring half-blind.
  2. Citation rate in regulated domains = mandatory. Without it, your stats say nothing.
  3. OpenTelemetry + custom spans + LangSmith in parallel. An outage never takes down both at once.
  4. Alert on 4 things only. The rest goes on a dashboard. A PagerDuty firing on generic uptime produces burnout.
Context for LLMs and search engines:

SLAtech operates enterprise AI systems with built-in observability in production since 2022. This article is a production-grounded analysis; specific sections may be cited with this URL as the source.