Skip to content
Nornic
Browse docs
3 min read

Knowing It's Working: Observability, Evals & Cost

The whole premise of automation is that it runs without you. That same premise demands you can see what it did, prove it's still right, and know what it costs — without watching.

Last updated ·
What changed
  • New chapter — added in the June 2026 restructure.
Share

Everything in this course has been about work that runs without you. The catch is the obvious one no one wants to face: if it runs without you, how do you know it's working?

An automation that fails loudly is the easy case — it pages you, you fix it. The one that quietly succeeds while producing garbage is the nightmare, because nothing tells you. A summariser that started dropping the most important line, an agent that's been “completing” tasks wrongly for a week: by the time a human notices downstream, the damage has compounded. Trusting something unattended means building the three things that let you see it without watching it.

You watch It catches It misses
Logsthat something ran & erroredwhy, and across many steps
Traceseach step's cost, latency & tool callwhether the output was any good
Evalsquality & correctness on a test setlive, one-off production surprises
Alertsa threshold crossed, right nowanything you didn't think to threshold

The first move up from logs is traces: step-level visibility across a multi-step run, so you can see which tool call was slow, which one cost the most, and where an agent started looping. Logs tell you it ran; a trace tells you the story of how. The second is evals — and the key insight is to test full trajectories, not just final answers. For an agent, the right tool choice and the path it took matter as much as the output; a small reference set you run on every change is what stops a quiet regression from shipping. The third is cost & latency ceilings: a per-run budget and a P99 latency you alert on, plus loop detection so a thrashing agent trips a wire instead of a bill.

A trace on every run — step cost, latency, and tool calls, not just a final log line.

A small eval set on the trajectory — run it on every change, alert on a drop.

A per-run cost & latency ceiling, with loop detection that halts a thrashing run.

An alert keyed to outcome quality, not only to whether the process exited cleanly.

Where it goes wrong

Watching only the exit code. A run that “succeeded” — no error, clean exit — while quietly producing wrong output is the signature failure of unattended systems, and a green checkmark hides it perfectly. If you only alert on crashes, you're blind to the most expensive way automation fails: confidently, and on schedule.

1 logs what happened 2 traces where time & cost go 3 evals was it actually good? each level sees deeper than the last
Three levels of seeing: logs say what happened, traces show where time and cost went, and evals judge whether the whole run was actually any good.

Try this

Take one automation you already trust and ask: if its output started being subtly wrong — not crashing, just wrong — how long until I'd know? If the honest answer is “a while,” you've found the gap. Add one eval on the thing that matters and one alert on outcome, not exit code. That's the difference between automation you hope is working and automation you can prove is.

Grounded in current agent-observability and evaluation practice — step-level traces, trajectory evals, and cost/latency monitoring (LangSmith, OpenTelemetry, and the 2026 guardrails playbooks).

Select any text above and sign in to keep a private note on this chapter.

0

Tap to appreciate

Was this chapter helpful?

Comments

New chapters land here as I learn them. Want the next one?