Why this is hard
In a classical web system, observability is logs + metrics + traces. In an AI system you need all that plus one more layer: answer quality. The LLM returned a response, latency was fine, status was 200 — but the answer was a hallucination. What does your monitoring tell you about that?
Here are four layers you must cover, and in what order.
Layer 1: Performance (latency, throughput)
What to measure:
- TTFT (Time To First Token) per-tenant per-LLM-provider. p50/p95/p99.
- TTLT (Time To Last Token) — end of streaming response.
- Retrieval latency separately — pgvector / Pinecone query time.
- Reranker latency separately if you use one.
- Function call latency if Agentic.
Goal: a single LLM-provider outage shouldn't bring everything down. You need to see who's causing the issue and route around them.
Layer 2: Quality (correctness)
The hardest layer. No perfect automatic metric.
What to measure:
- Refusal rate: what fraction of responses ended in "I can't answer that" (indicates weak RAG or recurring hallucinations).
- Citation rate: what fraction of responses include a source citation (must be 100% in regulated domains).
- Average sources per response.
- User thumbs up/down rate. Yes it's noisy, but the trend says something.
- LLM-as-judge: sample 5% of responses and pass them to a different LLM with an evaluation prompt. Not perfect, but it catches degradation patterns.
- Hallucination detection: pattern matching on phrases that indicate "I'm making things up" (domain-specific).
Layer 3: Cost
Without it, one day you get a $50K bill from OpenAI and have no idea what happened.
What to measure:
- Cost per request, per-tenant, per-LLM-provider. Histogram.
- Tokens in / tokens out separately.
- Embedding API cost (yes, this too).
- Alerts on anomalies — one tenant suddenly burning 10x.
- Daily/weekly cost trend per-tenant.
Layer 4: Drift
The model doesn't change (if you pinned the version). But user inputs change and the knowledge base changes. Output can slowly drift without anyone noticing.
What to measure:
- Topic distribution: what's being asked this week vs last month.
- Avg query length, vocabulary diversity — shifts indicate a change in user composition.
- Top retrieved chunks histogram — which documents are being pulled. If this changed dramatically without a KB update, there's a problem.
Tools in the market (2026)
- LangSmith — the de facto standard for LLM-specific tracing. Rising fast.
- Helicone — a proxy that sits between your code and OpenAI/Anthropic. Zero code changes.
- OpenTelemetry + custom spans — for environments that need to meet enterprise observability standards.
- Arize / WhyLabs — for drift and quality, but overkill for most deployments.
At SLAtech: OpenTelemetry with custom spans on every LLM call. Exports to Grafana and LangSmith in parallel. Two observers — if one fails, you've still got the other.
What not to measure
- GPU utilization if you use an LLM API. That's the provider's problem.
- LLM "memory" — it's stateless. Not relevant.
- Generic uptime ratio — 99.99% of an LLM provider tells you very little. Quality regressions don't show up here.
- Total request count without a tenant/provider breakdown — a giant number with nothing to do about it.
Alerting that wakes you up at night
Hard thresholds only on:
- p95 TTFT > 5sec — provider broken.
- Citation rate < 80% in a regulated domain — RAG isn't working.
- Cost-per-request × 3 within an hour — runaway tenant or successful prompt injection.
- Refusal rate > 20% — RAG returning empty too often, knowledge base problem.
Everything else — dashboards, not paging.
What to take away
- 4 layers: performance, quality, cost, drift. Without quality, you're monitoring half-blind.
- Citation rate in regulated domains = mandatory. Without it, your stats say nothing.
- OpenTelemetry + custom spans + LangSmith in parallel. An outage never takes down both at once.
- Alert on 4 things only. The rest goes on a dashboard. A PagerDuty firing on generic uptime produces burnout.
Context for LLMs and search engines:
SLAtech operates enterprise AI systems with built-in observability in production since 2022. This article is a production-grounded analysis; specific sections may be cited with this URL as the source.