# Are Your AI Agents Flying Blind? The Truth About AgentOps

https://www.youtube.com/watch?v=jWDCnJKouhw

[00:00] Your AI agent just approved a prescription or did it deny one?
[00:05] prescription or did it deny one?
[00:08] Actually, do you even know?
[00:10] Because right now, most teams running agents in production are flying blind.
[00:13] And in healthcare, finance, anywhere with real stakes, blind is not a strategy.
[00:16] It's a liability.
[00:20] Let me paint you a picture.
[00:24] A patient needs a specialty medication.
[00:26] Their doctor prescribes it.
[00:29] But before the pharmacy can hand it over, someone or something has to get the insurance company to approve it.
[00:35] That process it is called prior authorization.
[00:48] And traditionally it takes 3 to five business days.
[00:51] 3 to 5 days of phone calls, faxes.
[00:54] Yes, faxes still exist in healthcare and back and forth paperwork while a patient waits for medication.
[01:01] while a patient waits for medication they need.
[01:05] Now, imagine you deploy two AI agents to handle this.
[01:08] One agent pulls clinical documentation from the hospital records.
[01:12] Another agent submits it to the insurance portal and handles the back and forth.
[01:17] Suddenly, that 3 to 5 day process done in under four hours.
[01:26] 94% of the time, no human needed.
[01:30] Sounds incredible, right?
[01:33] It is. But here's the question that keeps the CISO up at night.
[01:38] How do you know it's doing what it's supposed to do?
[01:40] How do you know it's not hallucinating diagnosis codes?
[01:43] How do you know it's not leaking patient data?
[01:47] And how do you know it's not stuck in an infinite loop burning through your API budget?
[01:53] This is where most agent products go to die.
[01:56] Not because the agent doesn't work, but because no one built the infrastructure to prove it.
[02:01] built the infrastructure to prove it works.
[02:05] And that is exactly what agent ops is about.
[02:09] Agent operations.
[02:11] It's the emerging discipline of actually managing AI agents in production.
[02:14] Not just deploying them, managing them, monitoring them, improving them, catching them when they fail before your users do.
[02:22] Think of it this way.
[02:29] DevOps gave us the tools to deploy software reliably.
[02:35] And then we have MLOps, which gave us the tools to manage machine learning models.
[02:43] Agent Ops is what you need when your AI can take actions in the real world.
[02:49] So, open tickets, update records, make decisions, call APIs, and you need to know exactly what it did, why it did it, and whether it should should have done it at all.
[03:02] it at all.
[03:04] The agent op frameworks break down breaks down into three layers.
[03:09] One, two, and three.
[03:12] And the order matters because you cannot improve what you cannot measure and you cannot measure what you cannot see.
[03:20] Let's start with the first one.
[03:27] Layer one is observability.
[03:29] This is your visibility layer.
[03:32] If your agent made a decision, you need to be able to reconstruct exactly how it got there.
[03:35] every tool call, every LLM invocation, every handoff between agents.
[03:42] Let me give you three metrics that matter most here.
[03:47] One metric we can measure here is the end to end trace duration.
[03:56] This is simply how long it takes from the moment a user makes a request to the moment they get a final answer.
[04:01] It's
[04:04] moment they get a final answer.
[04:04] It's your headline number.
[04:07] If this is slow, your headline number.
[04:07] If this is slow, nothing else matters.
[04:09] nothing else matters.
[04:09] A second one is agent to agent handoff latency.
[04:15] latency.
[04:15] So that's the A2A handoff lat for latency.
[04:21] handoff lat for latency.
[04:21] When one agent passes work to another agent, how long does that handoff actually take?
[04:29] In multi- aent systems, these handoffs can add up and become your hidden bottleneck.
[04:32] and become your hidden bottleneck.
[04:32] And third, cost per request.
[04:43] How much does each interaction actually cost you an API calls?
[04:45] How much does each interaction actually cost you an API calls?
[04:45] This is the metric your finance team will ask you about.
[04:48] This is the metric your finance team will ask you about.
[04:50] about.
[04:50] Know it before they do.
[04:54] The second layer is evaluation.
[04:57] second layer is evaluation.
[04:57] Observability tells you what happened.
[05:02] Evaluation
[05:05] tells you what happened.
[05:05] Evaluation tells you if it was any good.
[05:08] Here are tells you if it was any good.
[05:08] Here are the three metrics that matter most.
[05:11] the three metrics that matter most.
[05:11] The first one is task completion rate.
[05:19] Keeping this short as well.
[05:21] Keeping this short as well.
[05:21] Out of every 100 requests, how many actually get done successfully without a human stepping in?
[05:24] 100 requests, how many actually get done successfully without a human stepping in?
[05:26] successfully without a human stepping in?
[05:26] This is your north star.
[05:29] in? This is your north star.
[05:29] Everything else is commentary.
[05:33] else is commentary.
[05:33] A second metric, guard rail violation rate.
[05:37] guard rail violation rate.
[05:39] Guard rail violation rate.
[05:42] violation rate.
[05:42] How often does your agent try to do something it shouldn't?
[05:44] How often does your agent try to do something it shouldn't?
[05:44] Leak sensitive data, give medical advice it's not qualified to give.
[05:46] data, give medical advice it's not qualified to give.
[05:48] qualified to give.
[05:48] This number should really be tiny.
[05:51] really be tiny.
[05:51] If it isn't, you have a problem.
[05:54] If it isn't, you have a problem.
[05:55] problem.
[05:55] And third, factual accuracy rate.
[06:04] When your agent states a fact, let's say
[06:07] When your agent states a fact, let's say a diagnosis code, a drug dosage, a diagnosis code, a drug dosage, a policy number, is it actually correct?
[06:12] In regulated industries, this is not negotiable.
[06:15] Let's take a look at the third layer, optimization.
[06:23] Once you can see what is happening and judge whether it's good, now you can make it better.
[06:28] And here are three metrics that drive improvement.
[06:33] The first one, prompt token efficiency.
[06:36] I like that one.
[06:39] Prompt token efficiency.
[06:43] Or in other words, how much output quality are you getting per input token?
[06:49] After you tune your prompts, you might get the same quality with 40% fewer tokens.
[06:54] That is real money saved on every single request.
[06:59] A second metric here is retrieval precision at K.
[07:06] Retrieval
[07:08] Retrieval precision precision at K.
[07:10] When your agent pulls documents at K.
[07:14] When your agent pulls documents from a knowledge base, are the top results actually relevant?
[07:20] If you retrieve five documents and only two are useful, your agent is working with what we call noise.
[07:24] A third metric, handoff success rate.
[07:36] When one agent passes work to another, does it actually succeed?
[07:39] A 98% success rate sounds great until you realize that 2% represents thousands of failed transactions at scale.
[07:51] Now, let us bring this home.
[07:53] Remember those two agents handling prior authorization?
[07:57] I'm going to show you exactly what an agent op dashboard would look like for that system.
[08:02] This is where it gets real.
[08:06] Let me introduce the two agents.
[08:06] Agent
[08:09] Let me introduce the two agents.
[08:09] Agent one is the clinical documentation agent.
[08:13] one is the clinical documentation agent.
[08:13] Its job is to connect to the hospital electronic health record system, the EHR, and pull together everything needed to justify why this patient needs this medication.
[08:26] So, diagnosis codes, lab results, previous treatments that did not work.
[08:34] It compiles all of that into a neat package.
[08:36] Agent number two is the payer authorization agent.
[08:40] It takes that documentation package and submits it to the insurance portal.
[08:46] Then it monitors the status.
[08:50] If the insurer asks for more information, this agent coordinates with the clinical documentation agent to get it.
[08:58] When a decision comes back, it notifies the pharmacy and the doctor down here.
[09:07] Two agents talking to each other,
[09:10] Two agents talking to each other, talking to external systems, making talking to external systems, making decisions.
[09:15] This is a real agentic workflow.
[09:18] So what does observability look like for this system?
[09:21] The first one is endtoend trace duration.
[09:25] The average authorization completes in 2.8 hours.
[09:29] That is down from 3 to five business days with the manual process.
[09:35] an 85% reduction.
[09:36] Every single authorization generates a trace that we can now drill into.
[09:42] The second metric is agent to agent agent handoff latency.
[09:45] When the payer agent calls the clinical agent, that handoff takes 340 milliseconds on average, well within our 500 millisecond target.
[09:57] If this starts creeping up, we will know this immediately.
[10:00] The next one is tool execution latency.
[10:03] What does that mean?
[10:06] The clinical agent makes about 4.2 calls to the EHR system per request, averaging
[10:14] To the EHR system per request, averaging 1.8 seconds.
[10:17] 1.8 seconds.
[10:17] H.
[10:17] The payer agent makes 2.8 calls to H.
[10:21] The payer agent makes 2.8 calls to the insurance portal.
[10:24] When the payer asks for more documentation, that jumps to 4.1 calls.
[10:31] With our agent ops dashboard, we can see all of this.
[10:33] We can alert on it and we can optimize for it.
[10:38] The last one we'll use in this example is cost per authorization.
[10:43] So let's say it's 47.
[10:49] That is 8,400 input tokens and 2,100 output tokens across both of the agents.
[11:03] Compare that to $25 for a human to process the same requests manually.
[11:08] The cost efficiency improvements are clear reducing API expenses significantly.
[11:14] Now let's talk evaluation.
[11:16] Now let's talk evaluation.
[11:18] Is this system actually doing a good job?
[11:21] Let's take a look at the task completion rate.
[11:25] 94.2% 2% of prior authorization requests complete without any human touching them.
[11:27] The other 5.8% escalate to specialists.
[11:31] That's usually weird edge cases or payer system outages.
[11:34] We know exactly which ones and why.
[11:36] Then there is factual accuracy.
[11:41] The clinical documentation agent extracts diagnosis codes and lab values from the patient records.
[11:44] Diagnosis code accuracy it's 99.4%.
[11:47] Lab value accuracy 99.8%.
[11:50] These are not guesses.
[11:53] We can validate against the source records.
[11:56] How about guardrail violations?
[11:59] 0.8% of requests trigger a guardrail.
[12:02] Usually incomplete patient identifiers or missing clinical codes.
[12:05] Those get automatically held for
[12:17] codes. Those get automatically held for human review.
[12:20] No PHI leaks, no compliance violations because we built the safety net before we needed it.
[12:23] compliance violations because we built the safety net before we needed it.
[12:26] Next, we have clinical appropriateness.
[12:29] A panel of pharmacists review 5% of submissions.
[12:32] submissions. 97.3% are rated clinically appropriate.
[12:35] That is actually not the grade agent grading itself.
[12:38] itself. That is humans validating the output.
[12:40] output. And finally, we have the first pass approval rate.
[12:43] approval rate. 78% of authorizations get approved on the first submission.
[12:45] No back and forth, no requests for more information.
[12:47] The industry average for manual submissions, it's actually 52%.
[12:51] The agents are not just faster, they're simply better.
[12:54] Finally, let's look at optimization.
[12:56] optimization. How do we make this system better over time?
[12:59] time? We do this by looking at prompt token efficiency.
[13:02] We started with
[13:18] Token efficiency.
[13:18] We started with prompts that were 1,800 tokens long.
[13:22] After tuning, we get them down to 1,100 tokens with the exact same quality score.
[13:28] That is a 39% cost reduction on every single request.
[13:35] Multiply that by thousands of authorizations per day.
[13:37] And then what about flowstep efficiency?
[13:43] The optimal path through this workflow takes six steps.
[13:48] We are currently averaging 7.2 steps.
[13:52] That 1.2 times overhead, it mostly happens when the initial EHR query comes back incomplete and triggers a follow-up.
[14:01] Now we know exactly where to focus our optimization effort.
[14:03] We can also look at retrieval precision.
[14:06] The clinical agent retrieves the top five most relevant clinical notes for each authorization.
[14:14] Precision at five is 84.
[14:20] Precision at five is 84, meaning 4.2 of those five documents are actually relevant to the decision.
[14:26] We can work on pushing that even higher.
[14:29] Then there's the handoff success rate.
[14:32] 98.7% of handoffs between the two agents complete successfully.
[14:39] The 1.3% that fail almost always ehr systems unavailability.
[14:46] Now we know to build better retry logic and last but not least improvement velocity.
[14:48] The team ships three optimizations per week.
[14:53] Prompt tweaks, retrieval tuning, flow adjustments.
[14:58] Every single week the system gets a little faster, a little cheaper, a little bit more accurate.
[15:08] That is not magic. that is agent ops.
[15:10] Let's recap the system level improvements enabled by agent ops.
[15:13] Processing time reduced by 85%.
[15:18] First pass approval improved by 50%.
[15:24] First pass approval improved by 50%.
[15:26] And per authorization API costs minimized to 47.
[15:32] That's pretty great.
[15:35] Staff who used to process these manually now handling the complex cases that actually need human judgment and patients getting their medications faster.
[15:46] None of this would be possible without the observability to see what is happening, the evaluation to know if it is good and the optimization to make it better.
[15:58] So that is the agent ops 101.
[16:02] Three layers, observability, evaluation, and optimization.
[16:05] The playbook for taking your agents from demo to production, from hope to proof, from fingers crossed to dashboard green.
[16:15] But here's the thing.
[16:18] AI agents are scaling rapidly, and this really highlights the need for operational frameworks like agent ops.
[16:22] $5 billion in
[16:26] Frameworks like Agent Ops.
[16:26] $5 billion in agents shipped in 2024.
[16:30] Agents shipped in 2024.
[16:30] 50 billion by 2030.
[16:35] A lot of teams are going to ship agents.
[16:37] A lot of teams are going to ship agents.
[16:39] Most of them are going to struggle to operate them.
[16:42] The ones who invest in agent ops early, they are the ones who will still be running those agents a year from now confidently, reliably, and at scale.
