# Transcript

**Ankor**: [01:56] So let me first hit kind of the main takeaways. Let's start with what we expect to end with. **We're going to cover verification of agentic AI.** There's a lot of conversation around how AI can do everything under the sun and probably even solve world hunger. **We're going to talk about today how every AI output needs to actually be verified and without that, it probably can't be operationalized in an enterprise environment.**

**Ankor**: [02:32] We will cover, let me actually get the marker working. We will also cover how enterprises, your peers, are deploying these AI solutions and how they are trying to increase trust, reliability, and what we call verification of the output. And then finally, we'll actually make it real. We'll talk about use cases that are relevant to insurance and asset management, some of which are actually being deployed. And we'll talk about the underlying verification architecture there and actually make it real with some demos of what verification looks like in a deployed environment.

**Ankor**: [03:20] Before we get there, just a very quick piece around Straive. **We are really a one-stop data and AI operationalization specialist. We bring two capabilities which we think are very critical: the cutting-edge capability to build the AI solution, but then the operational muscle and the work in the trenches to make sure what gets built actually gets operationalized and delivers impact.** We work at scale. We work across 350 plus clients. So the perspective we're going to bring you is built on research, built on real experiences across your peers as well as across industries. And really, **it's a perspective built on building and deploying AI at speed to deliver impact, but also knowing what it takes to be able to trust the AI output.**

**Ankor**: [04:30] I'm Ankor Rai, I'm the CEO at Straive, over 25 years of experience across data and then really the last five to eight years across helping enterprise clients deploy AI. I'm based out of New York, in London today. I'll introduce the Straive team for efficiency. Anand Subramanian, he leads our R&D and does the coolest work in the company. He also likes to call himself an LLM psychologist. I think he's the LLM whisperer. And some of his posts and work he does on LinkedIn might be interesting for all of you at a later time. And then we also have Manish on the call. He is the client partner and really understands and delivers impact. In today's presentation, Anand and I will tag team.

**Ankor**: [05:37] So let's get to define the problem. What are we even talking about? So we'll start with some examples over the past where AI solutions without robust verification have created reputational issues, regulatory issues, and operational issues. And these are famous ones, they're really from the public domain and we're going to kind of derive some insight out of it. The United Healthcare example is very famous. An AI at work denied post-acute care to elderly patients, overriding physician recommendations. A large amount of these denials were reversed on appeal. The NYC chatbot, it actually advised business owners to violate city health codes because it interpreted those codes wrongly. The McDonald's example, they deployed AI at the drive-through, there were a variety of different errors, order processing, they had to withdraw that. In fact, there's another company, Presto, which delivers the technology for drive-throughs, and as part of their regulatory submissions, they talked about how 70% of drive-through orders which were processed by the AI actually needed intervention from their team in the Philippines.

**Ankor**: [07:08] So let's step back. These are examples of where the AI didn't deliver as expected. It's very easy to say, "Oh, it must have been a fault in the model," or "It must have been something that was not done." **The core issue we're going to tackle today is that it wasn't a fault with the data pipeline. It wasn't a fault with the way the model was built. It wasn't negligence from these respective teams. The underlying AIs are probabilistic.** Which means you can build the pipeline with the most diligence possible, and unless every single output has a verification architecture, you run the risk of delivering an output that is not valid.

**Ankor**: [08:00] So we wanted to define this problem using examples from the public domain before we go in deeper. So the point is, we are somewhere over here. We've worked with enterprises across, done a lot of research, we understand the opportunity over here, and we are at the cusp of what we would call really an agentic AI explosion. Which means the investments are exponentially increasing, the use cases are exponentially increasing. Why do we title this "It could get worse"? Obviously, it's coming off from the previous page. **Without actually having a robust verification architecture, these investments will fail because, like McDonald's which had to actually deploy the AI and then withdraw it, we will not be able to deploy these AIs in enterprise production environments.** What's worse, we'll deploy them and we'll run large risks. I think everyone's aware of this. This is kind of the optimism around AI: 78% of enterprises across the board have really deployed AI and are exploring more. Here's where it gets interesting: only 14% actually believe they have enterprise-level governance and verification.

**Ankor**: [09:28] But now we're going to start peeling the onion. Of all the deployed agents in production today, 74% rely on manual verification. Which means you have an AI output and seven to eight out of every 10 deployments actually have human experts looking at the output. And that's why it could get worse. **If you assume agentic AI being deployed across a variety of use cases, it is untenable to rely on humans across every use case in this way.**

**Ankor**: [10:11] So what's going on across your peers today? We've taken a few examples, but actually, pick any peer, a lot of these examples are standard ones that they have deployed. We're going to go deeper. We're going to look at the verification strategies used there and then we're going to try to derive industry-wide observations and insights that are going to drive how we think about this in the future. But first, I want to cover any particular company, I'm going to talk about the use cases in general because you're going to recognize these.

**Ankor**: [10:41] So at Morgan Stanley, there's an AI that basically assists across a lot of data and it's kind of an enterprise-wide chatbot. Here's what's interesting: it's not an AI that can do whatever it wants. It is pointed towards a curated library of internal research documents to bring the data. At JPMorgan, it's used, and actually let's look at this one along with HSBC. At JPMorgan, it's used for AML surveillance, fraud detection. At HSBC, it's used for AML transaction monitoring. And in both cases, it either hits a human expert who's using this for assistance or there is a human investigator who approves this. So again, a human in the loop. At BlackRock and UBS, it's used for internal research, for analyzing portfolio risk, and every output is actually validated across multiple models. There's deep stress testing and there are human experts.

**Ankor**: [11:49] What are the insights we derive out of all of these use cases? **Out of all deployed AI, 95% have an end user who's a human. The corollary of that is only 5% of agents today, their output directly enters another agent or enters another system.** Think about all your software in the environment today. They are not all validated by humans. Your software talks to other systems, talks to other software, and that's what basically builds an automated pipeline. Where we stand today, only 5% of agent output enters another system. 95% of that actually is consumed by a human. So it's a very complicated workflow that finally looks and feels like a chatbot.

**Ankor**: [12:47] Let's peel the onion further. When we look at actually the way the agents are coded, over two-thirds of agents only allow less than 10 autonomous steps. And what that means is, this is an example when you get into coding of how the autonomy of these agents is significantly constrained. And if we look at these two, the third insight is pretty obvious. **Most of the benefit from agentic AI today is really in increasing productivity by automating routine labor.** Things like proactive identification of failures or for that matter risk mitigation are really a very small proportion of the use cases.

**Ankor**: [13:41] Which brings us to what are the strategies we can leverage today to verify agentic AI output. The first is actually we restrict the autonomy through constrained deployment. 80% of deployed agents across enterprises today have static workflows. What's the meaning of a static workflow? We looked at some of the examples from your peers: knowledge and data grounding. We are not allowing the AI to decide where to get its data from. So in a truly autonomous environment, what you would do is you would expose your agent to APIs that could access the web or expose it to all the data sources that are available internally or externally and you would tell the AI to get the best answer. That's not the autonomy we're giving to our agents today. **We're actually telling our agents, we're pointing it to specific data sources so that we can track the quality of the output.**

**Ankor**: [14:51] Even more important, and this links to kind of the 10 autonomous steps per agent at any point, we're constraining the agent with specific domain steps. Let me give you an example of this. When we think about claims adjudication, the input to the agent is not just a claim where we tell it, "Now, should this claim be approved or not?" Agents are actually coded with specific rule-based steps, which is: first, go to these specific databases and look at whether there is coverage for this claim. So you're basically looking at the coverage, you're looking at the procedure, you're looking at the deductible. The second step, look at medical necessity. So the agent is being told to look at medical necessity and it is being coded as to how it looks at that, which is, if there was an MRI scan, was this actually necessary? Did the previous history justify that? **So you're automating a workflow today, but you're not giving the agent autonomy.** And these two are where the way you're coding the agent, you're constraining its functionality to be able to better trust the output.

**Ankor**: [16:13] Even after the output is ready, multiple strategies are used in tandem to validate the output. Over 74% of agents have human experts in the loop. I talked to you about the McDonald's example. The company Presto which provides the platform, they actually have agents in the Philippines who have to intervene to look at a large proportion of the orders processed by the AI. That's an example of human experts in the loop. But what we talked about here, there are multiple strategies in tandem. What does that mean? Along with human experts in the loop, over 50% of these agents utilize an LLM-as-a-judge strategy. What does that mean? For a particular task of the agent, so let's say the agent was a research agent, so you basically wanted it to produce a research note on a particular company or if it was a private investment that you were looking to make, you wanted it to aggregate data and produce an opinion. In that particular case, what would be done is you would prepare a golden dataset. Which is, you would prepare kind of 100 or 200 companies, you would provide the input data and you would look at the output. And you would then have a second LLM learn from this golden dataset and judge the new output that was produced. But notice one thing: the new output wouldn't be out of these 100 to 200. So all the LLM would be looking at is the actual output produced by the production agent and judging whether it sort of matched with the golden dataset.

**Ankor**: [18:13] Another example of LLM-as-a-judge: in the health claims adjudication example, you would produce a golden dataset of when a claim was approved or rejected, the audit trail that the LLM was expected to produce. Then, every time the production agent accepted or rejected a claim, it would—the production agent would produce an audit trail and this LLM-as-a-judge would judge whether it was appropriate or not. Cross-referencing: in terms of sales intelligence, if the LLM is providing sales intelligence, the output is actually compared against existing data in the repository to make sure that what the LLM is referencing is valid or not. This one is probably the simplest: there are standard rules. So if a claim is approved, there is a rule-based look at what the LLM has approved, which is if it approves a particular claim for a particular member, there is an outside-the-LLM-production-loop check of whether the claim met the approval limits and there are some other rule-based checks that get done. And then finally, simulations are an example.

**Ankor**: [19:33] This is how output is validated today. **The big insight out of this is it is fundamentally different from how software goes into production today.** What do we mean by CI/CD verification? Today, for normal software, we have a pretty standard, well-accepted process. Every time you make a change, there's a whole host of regression tests, there are edge cases, the software that has been changed applies against those test cases against the edge cases. You expect an output. If the new software produces that output, you know it's good to go and you're able to put it into production. **This sort of process doesn't work for agentic AI. Why? Because the outputs are non-deterministic.** For a software, once I check the pipeline, if it produces a result for a certain test case at one time, I know for a fact that it is going to deterministically produce the same output 100 times over for that same input. That is not the case for agentic AI outputs. They are non-deterministic because of the inherent probabilistic nature of the underlying algorithm.

**Ankor**: [21:00] Success can't be defined comprehensively. This one is actually a feature of LLMs. They can solve a variety—they take in input from a user. You can speak to an LLM. It's not like software today where you have certain clicks that you have to execute rigid processes. But just by virtue of that feature, which means that you can ask it anything, you can make it do anything, it can understand you, what it also means is that you cannot—with limitless input possible, you cannot define success. And as a result, you can't put together a comprehensive list of test cases that convince you that we have covered all possible inputs and outputs for the AI. And incomplete edge cases. I think this one flows from the fact that success can't be defined. In a normal software, you would look at all the edge cases and you would kind of test them. Over here, there are almost infinite edge cases because you can't define the input.

**Ankor**: [22:06] **So the big takeaway from here is: we have agents in deployment today, but the way we deploy them is either by not utilizing the full power that LLMs can bring—we actually restrict their autonomy—and in addition, for every output that is produced by the LLM, we have different layers of verification in place to make sure we can trust the output.** And now take these two green strategies along with the fact that over 75% of LLMs still terminate in a human. After all of this.

**Ankor**: [22:56] So then we come to, how do we get to the future? Because this is...

**Host**: [23:00] Ankor, is it just me or did we lose you for a second there?

**Ankor**: [23:05] Did you lose me?

**Host**: [23:07] For a second. Just for a second. Just wanted you to be aware. I don't think we missed much, but we did lose you for a second there.

**Ankor**: [23:15] Okay. Thank you, Host, for pointing that out. If it happens again, just tell me again. But I'm guessing by the way you talked about it, I'm not going to repeat too much. Really, I'm going to make the summary of this was, **if we continue to deploy LLMs by constraining their power or by having these layers of verification before we can trust their output, this curve, this explosion is not going to be possible.**

**Ankor**: [23:45] Which then brings us to what is the future of verifiable autonomy? And really, there are interconnected but five different vectors as we look to the future.

**Ankor**: [23:57] The first: **we are going to have to move beyond testing correctness to standard testing.** What does that mean? Today, every single output needs to be tested. We need to move towards the software testing procedures where if we test the pipeline once before deployment, we can trust that every single output produced by that pipeline, unless we make any adjustment to it, can be trusted. That's not the case today. So how are we going to get here? We need a better understanding and definition of agent failure modes. Let's dig a little deeper into what this means. Today, there's a lot of focus on every use case. Even from the frontier models, on every new use case that they can solve. We actually don't have a good understanding of what particular inputs actually cause the agent to fail, to hallucinate. We treat it as, "Oh, it's probabilistic." That's basically like an assembly line which produces 5% defects and we're like, "Oh, we'll figure out, we'll look at the defective pieces and move them out of the assembly line." That works for an assembly line, actually today it doesn't even work for an assembly line, but it definitely doesn't work for software that's being put into production. And so there's a lot of research underway not to define what is the new use case alone for every frontier model, but also to define what causes the frontier models to fail in particular use cases. Once we're able to define that, we can now put guardrails in place.

**Ankor**: [25:46] Third, and this is somewhat technical. If we think about how we design workflows today, there's a production workflow, so we do requirements, we design, then we develop. And then there's a whole process of different specialized teams which do quality control and which verify the pipeline. When it comes to LLM agents, there is initial success in the quality and the trust of the output if the same agent does production and for every subtask every time it produces an output, there it actually verifies that output too. So it's almost like continuous—actually if you use a real-world example, it's continuous testing within the same agent. That's not the way typically it's done today. There are agents which produce output and there is a different quality layer. There's integration into the same agent subtask.

**Ankor**: [26:52] When this happens, two things will become true. One, today 5 to 7% of agents interface with other software. That number will increase significantly when we can trust the pipeline and don't need humans to validate the output. And lastly, just as CI/CD testing is a very important step in deploying software, when it comes to agentic AI, verification is a standard layer. It will be part of the production. And what we mean by that is, you've got—I'm not going to go through all the layers of the enterprise stack. I think there's a business application, agent orchestration, there are the foundational models, and data. But **verification is a layer which will be part of the production loop and every output that gets produced will pass through a verification layer so that you can trust it.** Over time, we expect human validation over the next few years, the need for it to drastically come down, and a lot of the other programmatic verification strategies become more dominant. And we expect newer strategies like we talked about, for example, agents doing production and verification integrated together to increase. We also expect a better understanding of failure modes of agents. So as a result, part of this verification layer will be triage, which is in certain cases the output will be automatically ignored, which the corollary of which is in a lot of cases the output can be trusted and passed programmatically into other systems.

**Ankor**: [28:45] So this covered the concept, the problem of why verification is critical, how your peers and really across industries verification is being executed today, where we see the future, and why verification is absolutely critical and how it becomes a layer in the entire enterprise AI stack. What we now wanted to do was actually make this real by showing you AI solutions at work and how verification is built into them today and how we see the future. For this, I'm going to hand it over to Anand who heads our R&D to take us through this. And all of you can hear me, right?

**Host**: [29:41] Yes.

**Ankor**: [29:42] Okay, great. So I'm going to hand it over to Anand. And one point, some of the things which we're going to see here, so when I introduced Straive, we are working with technology teams where we're helping deploy a variety of AI solutions across real estate, private credit, sales, and Anand is going to cover a few of them as we go along too, in addition to other use cases that have been deployed across your peers. So Anand, with that let me hand it to you. I think you'll guide us if there are questions and when we need to answer them, maybe we can take them at the end. Anand, over to you.

**Host**: [00:27] Thanks. Folks chime in as it is, and then they're not losing train of thought on some of the questions. So we'll let it be interactive. Yes.

**Anand**: [00:34] Perfect. And do feel free to put in any questions on the chat window as we go along, as and when they come up, we can take those up as well. What I'll do is walk through some demos showing how what Ankor just described is working in action or, in some cases, not working and how do we fix those, which is the whole point of verification.

**Anand**: [00:54] So let's take sales intelligence and I'll show how knowledge and data grounding is working with an example from something that we've deployed at PGIM. This is the anonymized version of it. It's a sales intelligence tool that, for instance, would be able to tell us something along the lines of, for notionally Apollo Global Management, how they might deal with a client like Abu Dhabi Investment Authority.

**Anand**: [01:18] This is a report which I'll be going into in a little bit of detail, not too much. My focus is on how the verifiability, specifically through grounding, comes in. But among other things, it says look, your wallet share is 0.4%. There's massive headroom here for growth. And what you probably need to do is unlock a CEO-level meeting; the last meeting that has happened has been in 2019, and that relationship definitely needs to be renewed. Now, how is it doing this? What I'm going to do is, from that top level, dive down to the lowest level.

**Anand**: [01:54] Step one, it goes through a series of sources like the annual review, their website, our website, and so on, as well as a series of internal sources that includes the CRM for instance, any local documents that we have, and identifies a series of claims, signals so to speak. So number one, it's identified as an organization that the sheikh has been confirmed as an MD and has high confidence around this. That, for instance, the ADIA office is operational at the Al Maqam Tower and again with high confidence, but there's only medium confidence that the five-year expected returns, which it gathered from the annual review, will be between 12 to 15% and so on. Anything with low confidence it knocks off. Now how does it know this? Because we are asking the LLMs to not just share the result along with the citation, but also its level of confidence on this conclusion. And it tends to do a reasonable, not perfect, but reasonable job, enough for us to get a first cut filter.

**Anand**: [03:05] **The important thing is this increases the verifiability. At any point we can just come back, take a look at the annual review, see if this is in fact warranted, and then make changes.** But on top of these 65 signals are built the strategic narrative, which takes, to begin with, a series of observable facts. Just the facts, no interpretation here. It's saying for instance, let's take one, that the existing ADIA relationship comprises of $2 billion in legacy. How do we know this? That's coming from document 52 or signal 52. And each of these then contribute to a set of inferences. These inferences are a set of fact one, fact two, fact three, leading to a conclusion.

**Anand**: [03:49] Let's take some example. It's saying that the new private equity head, Faris, is a relationship asymmetry opportunity, but the window is closing. How do we know this? He was appointed in Q3 2024 replacing the incumbent, that's a fact, we know where that's from. HarbourVest is advising them on the PE shortlist, Carlyle and KKR reported they are already on their shortlist, that's the closing window bit, and it concludes saying the failure to secure a meeting with him before the incumbents... that's clearly an indication that the relationship window is time-bounded, and it has high confidence around this. But there are some others, I'm just trying to see, yeah, here's another one where the structure is logical but we aren't really sure about the signal.

**Anand**: [04:33] So it's building the second layer with a series of confidence signals which we can verify, and third, here are a set of things we know we don't know. Who leads ADIA's Innovation and Technology team for instance? Why did we lose to KKR Capital Markets on Atlas SP, and so on. So A, facts, B, on top of these observable facts a set of inferences, as well as known unknowns.

**Anand**: [05:01] Which then builds up into a series of actionable insights. Here's something that we may want to do. And how do we know this? Here are the signals, here's the topic, and I'm not going to go into the details of this, but this builds it up to another level of what actions we need to perform, prioritized by urgency. Now not everything that needs to be acted on is immediate, some of them do have a longer time window. So depending on what we need to look at immediately and who needs to be looped into this, that gives us a view that prioritizes the actions, which is then summarized at a higher level into the strategic actions. And this captures the decision makers, influencers, what we need to do with all of them.

**Anand**: [05:44] **What's the premise here? That if we want something that's verifiable: A, show the original source, show the grounding. Ask for a level of confidence, which it tends to do a reasonable job of, and show it clearly enough to a human so that they can say, "Yeah, this makes sense," or "Yeah, I can just click here and go check."**

**Anand**: [06:07] Let's take this to the next stage. And before I go there, I'll just reiterate, any questions or thoughts or comments that come up, please just feel free to put them on the chat window.

**Anand**: [06:16] The next thing we're going to look at is contract analysis. The difference here is, what if this was not produced by an AI, there is a current process, we still want to verify it. Could AI help with the verification? And to what extent can we automate that validation, specifically using the LLM as a judge.

**Anand**: [06:36] Let's take a sample contract. We generated a synthetic contract, which could be a supplemental accident and life certificate. This document has the usual terms. Now what we want to do is check against a checklist whether this is in fact covering all the kinds of clauses that we need to verify against. So enter our contract analysis application, and in this particular case, there are a set of 10 contracts that we fed to it a few hours ago against 20 checklist items. What are those checklist items? Here are some examples: the insurer, policy owner, and insured identification, are they mentioned? Is the benefit schedule and coverage amount mentioned? Is there a reinstatement after lapse that's covered, and so on. And you can see that against these 10 contracts, it's found that some of these terms are not fully covered. There are a few contracts that don't have them, maybe with good reason.

**Anand**: [07:33] But how is it going about doing this? Well, taking a contract like this, what it does is, and let's in fact show you, this is the matrix by the way of the individual contract against the specific checks. So we know for instance that contract 7 is okay on the policy number issue date etc., because—and here's the citation, same principle as before—section 1.2 has this. Which means that I can just go to contract 2, or wait, I can just open that here, right, contract 2. And what was that it said on section 2, policy number blah blah blah. If I wanted to cross-check, I can check if that is in fact... oh yeah, okay, it's in the header right up there, as well as the policy number.

**Anand**: [08:19] **So fact-checking becomes easier for human as well as for agent, and that also allows us to identify with some confidence that there's a gap.** This is not present because it's saying there is no appeal or reconsideration process that it could find in contract 3, letting us go down there.

**Anand**: [08:39] Now how does this work? So what we do is take one of these documents, so I'm going to take let's say contract 4 as an example. It converts the PDF into text and takes it against the checklist items. And analyze goes through one by one and it's going to start running. It's saying okay, insurance policy owner and insured identification... yeah, I get this, I get this. Oh, but privacy and data use notice, it did not find it. So this has completed at a cost of 3 cents and 6 seconds. The validation of a contract against this set of terms, citing exactly where the clause is.

**Anand**: [09:17] **Could this be wrong? Yes, potentially, and I'll come to how we deal with that. But it increases the human verifiability in two ways: one by accelerating with a great degree of confidence where it's found stuff, which is the citations, and B, with also a provision of what's missing that reduces what the human has to check.**

**Anand**: [09:41] Question from Host: Does the solution reconcile event attributes from the signals, inferences and actions back to the ontology or semantic layer of the data sources? Meaning if we have the data in a structured way, is it able to reconcile? It can. In that particular case, there was no ontology. What I'm going to show is how we can automatically construct a set of almost programmatic rules, ontology-like, from unstructured documents in maybe 5 minutes or less.

**Anand**: [10:14] Let's talk about now, what happens if even this is wrong? I'm going to take a customer support example, not going through the demo itself, but showing what the post-facto analysis showed. And this will touch upon the cross-referencing technique that Ankor mentioned earlier.

**Anand**: [10:32] **The premise is that we do have hallucinations.** When we take a series of simple customer support messages like "When will I receive my order?", did it actually classify it against the delivery period category or did it make a mistake? So in this particular case, the model that we are looking at here, GPT 4.1 mini, made a mistake and put it against "track order" whereas the query should have gone into the delivery period queue. And there are many other mistakes that the model made, because these are some of the more complex questions, but for the simpler questions the number of mistakes is far fewer.

**Anand**: [11:10] But if we take any of these models, on average the error rate is pretty high. I'll come to the actual error rates. The error rate was about 14% for the average model. Not great. But what we then looked at was, how about if we take two models and cross-check them? No two models are independent. If we take for instance, let's take the bottom model here, Google Gemma 3, and correlate it against let's say the first model here, Anthropic Claude 3.5 Haiku. The correlation is moderate, it's about 0.2-ish. That's not a very high correlation, which means that if one model gets it right, the other model doesn't... if one model gets it wrong, the other model doesn't tend to get it wrong as often.

**Anand**: [11:58] Now with that kind of cross-correlation, by two models cross-checking, the error rate comes down to 3.7%. Why? Because we only take it to the next stage if both models agree. And the chance that they'll both agree independently is 3.7%. But we have to therefore create a manual queue for when they disagree, and that takes up the manual effort to 12.6%. We could triple check. Quadruple check. Quintuple check. And when we do that, because these models are independent, that reduces the error rate to 0.7%, but it still means we've got to have a manual process that's at 28% effort.

**Anand**: [12:37] **Now imagine somebody comes to you and says, "I'll give you 99.3% accuracy and save you only 72% of your effort." So you've got to put in three people instead of ten at 99.3% accuracy. In some processes, you'd grab it.** It makes perfect sense. And the worst case scenario is that it's just providing you an additional signal.

**Anand**: [12:59] **So this kind of cross-referencing further increases the extent to which we can start automating the validation, but still doesn't take it to the point where it's almost programmatic. Meaning I can say with confidence, "Yes, this is right," or "No, this is not." And that's the kind of thing we expect with code. And therefore, we actually start using code for this.**

**Anand**: [13:25] Now, large language models, they are language models, they are not calculating machines, they are not numerical models, and therefore they are good with language. Programming languages, domain-specific languages. So can we start applying these to those? Here's an example. Specifically let's take claims adjudication. Cambridge came up with a Prolog-based programming language called InsureLLM. InsureLLM is a way of representing insurance claims in a programmatic structure.

**Anand**: [13:56] Let me show you how this works. Let's take a specific contract. This is an Autoguard vehicle insurance, it has an insurer, policy number, blah blah blah, and could I make this a little smaller? Yeah, I think I can. And this has terms and conditions like the driver, the insured driver must be at least 21 years of age, the vehicle ownership, the coverage only applies if the insured is operating a vehicle they legally own, blah blah blah, and all of these are terms as part of the contract.

**Anand**: [14:28] What we can do, and this was done using an LLM, is convert this into a series of InsureLLM rules. Let's take a look at the structure of an InsureLLM claim. So it says, a valid claim is one where the driver is fully eligible, which is true if they are age eligible, which is true if the age is greater than 21, their driving experience is over 12 years. And the class verification is if they are over 25 years old and their experience is 24 years, or their age is 21 years and the experience... age is between 21 and 25 and their experience is over 12 years. In the latter case, we also need a novice endorsement, which is a young driver. Now it's effectively created into a Prolog-based programming language, one particular section of that claim.

**Anand**: [15:24] Now if we have this, then the other thing that we can do, and by the way, this same thing can be placed in a variety of different formats. We can start looking at this more like a decision tree, which I should be able to move around, I'm not sure why... oh yeah, fit. Or yeah, expand a bit. Ah okay, sorry, I should zoom in a bit. Okay, not sure how to expand it, but either that or a flow chart that says step-by-step how you go about granting this claim. All of this is derived from these InsureLLM rules, which we can then also validate against a claim.

**Anand**: [16:05] So for example, let's take James Ellison's claim. This is the claims form that James submitted, and we have a coverage checklist. We know their age, blah blah blah, so at least it's submittable as a form. And this then gets converted into a series of facts in InsureLLM language, which is that the age is 34, the driving experience months is 120 and so on and so forth, which allows us to do a programmatic check on the claim. Step by step, are they age eligible? License compliant? Etc. **This is not LLM generated. This is a program running a check, and cannot go wrong. More importantly, it is able to tell us the "why" a claim is rejected. I mean, if it's accepted, then it just meets all criteria, which again is able to tell us.**

**Anand**: [16:54] But let's take Marcia. Marcia's claim goes through the same set of checks and mostly seems fine, but there is an issue here. Let's zoom in. That seems to be a DUI violation. The blood alcohol content should be less than 0.08, but it's 0.14, and this claim therefore, if we go to the end, is rejected. One particular branch failed, specifically it's not a valid claim because the incident circumstances covered failed, because the sobriety verification failed, because the sobriety check failed, because the blood alcohol content is 0.08, DUI violation. There could be multiple failure points. Let's take another contract, George Whitefield. Let's take George's claim. And this is failing in multiple cases: continuous enrollment satisfied, pre-existing waiting, whatever. We just go through the details and find that here is the reason that we can verify as to why this particular claim was denied.

**Anand**: [17:53] **Now, it's possible that the LLM goes wrong. The way however it goes wrong is by misunderstanding the claim perhaps.** So in the construction of this InsureLLM structure of the contract, it may have gone wrong. But this is a one-time check, and what we can do is have someone go through the policy, check if the structure of the rules that it has automatically generated are in fact correct or wrong. Once done, we can start applying that against contracts, and that verifiability is auditable.

**Anand**: [18:27] This too can be further automated. What I mean by that is if there are mistakes that it makes in the process, we can have it run against a few checks, verify if it's making a mistake or not, and have it self-correct. And that takes us to the next level of verification, which is simulation and automation, which is where coding agents start coming in. That's what I'm going to go into with the investment research example.

**Anand**: [18:53] Again, any point, please feel free to put in questions or comments. This is the last demo that I'll be showing. What we have here is an API agent. The API agent is able to communicate with a variety of APIs, and I'm going to specifically show how it can connect to the Federal Reserve Economic Data, which is what FRED stands for, and this dataset an analyst might query to find out for instance what's the 10-year treasury yield over the last six months. Has the 10-year treasury yield versus the 2-year treasury yield flipped and therefore there is a reversal and we are going into a different interest rate regime, those sorts of things. Of course, you could pull data from all kinds of APIs and data sources, but let's start with a simple question. Let's see, what is the 10-year treasury constant maturity rate over the last, let's say a week, on a daily basis or something.

**Anand**: [19:58] Now I'm going to pick either I could pick a cheap model or a good model, and I'm intentionally going to pick a cheap model to demonstrate where it might fail. Now there are simple cases where it might work, there are reasons why we might want to deploy a less expensive model at scale, but good models are becoming cheaper as well, so that's what we would normally go with. But to illustrate a point, I'm going to pick a cheap model, and what that's doing is it's not directly using its intelligence. It is instead thinking about how it can query the API and dynamically writing the program to send a request to the API. I've given it an API key out here, so it has the access to connect to the API. But because I've used a cheap model, there's a good chance there's an error in the code, and it in fact has failed.

**Anand**: [20:48] **Now this is where the error message comes in handy, because it has the ability hopefully (if not I will go back to another window) in a few seconds it should think about the error and... yeah, there is a validator that says let's revise this. The reason is probably because the API key parameter name, which it guessed, is wrong, and the error message says the API key is not set. Great.** So it continues and rewrites a different piece of code, and now it's running the code, and here we are. So as of the 11th, the rate was 4.21, then it became 4.27, 4.28, etc. Which is something that it can then send to the next step in the agent, convert that into a chart, give me additional calculations, and a whole variety of things.

**Anand**: [21:36] **And the validator once again checks and says "Fine, I've got your results and therefore we are done." This ability to loop through continuously is the core of agentic behavior. The ability to write code and self-correct in the process is what gives it a significant leg up on the verifiability side, because writing code fails in very different ways. It either works or it doesn't work. It's not hallucinating in the traditional sense, and that's the kind of reliability we want. And the self-correction means that we have the ability to constantly iterate and go forward.**

**Anand**: [22:11] There is METR, METR is... I'm going to look at the long horizon tasks image. They publish a chart which is extremely interesting that shows how long agents can work independently. And this time horizon is constantly increasing. As of 2026, agents can work, something like [GPT 2 5? word?] or let's take the latest one, Claude Opus 4.6 can work for 12 hours continuously. We just saw something work for what, 12 seconds? Well, okay, take the whole thing, a minute.

**Ankor**: [22:49] Anand, there's some questions coming up on the chat too, so just a heads up about 6 minutes left, and then we can cover it. Back to you.

**Anand**: [22:56] Yeah, I should not take more than 16 seconds. So, because this trend is going almost exponentially, this is a logarithmic scale, so this is just growing exponentially, in a few months we should be seeing agents work for days, weeks. **And with that, the kind of self-correction ability that they have, because it's so robust, means that tasks we aren't thinking of yet will start becoming far more doable. So we're looking forward to a fairly exciting future out there.**

**Anand**: [23:32] Maybe questions now, or you want to wrap up and then we take questions, Ankor?

**Ankor**: [23:36] [inaudible] You're on mute.

**Ankor**: [23:42] I think we can include some of the questions that came, because they're more organizational as part of the wrap up, and then see if we have anything else. So first, thanks Anand. I think as one gets into kind of the real deployment of verification, it can get technical and it can get, what I call "trenchy," in the trenches. **But you know what they say about great NFL teams and Super Bowls? They're won in the trenches. So this is what makes sure stuff is actually operationalizable.**

**Ankor**: [24:14] But I think there was a question on the chat, and while I typed out part of it, let me talk about how we are seeing different organizations progress towards verification. So first I think it's pretty clear, but if you talk to your peers and your friends especially at some of the frontier model companies, verification is actually the biggest problem they are trying to solve. Along with AGI. So you kind of have to balance both of them. But when we come to kind of client enterprises, the pathway towards verification actually is not very dissimilar to how enterprises have looked at deploying quantitative models of the past.

**Ankor**: [24:59] So after GFC, after the Global Financial Crisis, when the OCC and the Feds basically passed rulings, potentially actually became a part of it in the "too big to fail," they basically said every quantitative model and every tool that is deployed needs to be validated and monitored. Today, where we are in kind of the AI cusp, is actually there's a lot of focus on, "Oh, what else can it do?" And that's, you know, that happens with any new technology. Very quickly we're going to have to use and we're ending up having to institute similar processes to what we had for model validation. What do I mean?

**Ankor**: [25:34] **Before any model goes into production, it actually needs a separate validation layer. And just as the second line of defense, where the developer is different from the validator, there needs to be and will be a separate team validating AI models and LLM models.** If you step back and think about it, that's not what happens today, actually. When you have these chatbots going into production, there isn't a separate organizational layer that is validating it to the point of doing that. And there's a lot of, if you go into any financial services company, there's a lot of push, pull and friction between the second line of defense and the production teams. So that's a very standard process.

**Ankor**: [26:14] Quarterly monitoring goes into place, which is very similar for ML models. We worked across financial companies in 2017 where when ML models were deployed, they had exactly the same sort of problem that we're talking about today. OCC came about and said, "How can I trust the output?" And very similar things were done. Look at the data, look at the process, look at quarterly monitoring reports. So those are two very big pieces.

**Ankor**: [26:38] Organizationally, we think, I think a lot of organizations have started moving towards a Chief AI Officer. But it comes from a perspective of, "Oh let's control all the AI development we're doing, bring more method to the madness." **A big part of the AI Officer's responsibility is going to be verification. You'll end up having development across the enterprise, but before it goes into production, verification is going to become a very, very standard practice.** And that kind of goes, dovetails into the verification layer being part of the stack and a specific part of the organization being responsible for verification, different from the developers, different from the product teams, different from the businesses.

**Ankor**: [27:28] **And when you step back, this is not actually rocket science that I'm talking about. You go back to every quantitative analytical advancement, finally you have separate organizations making sure that you can trust the output. That's where we see the future.** We're really, I would say very passionate, and we think this is very, very interesting. While it might be a little sobering, we feel we're bringing more sense to this topic rather than saying, "Oh it's all about hallucination." There's method to kind of cut through this to make AIs really operationalizable. That's what we had.

**Host**: [28:03] That makes sense. Are there partners of yours that have made more headway in operationalizing these environments than others? So far, or is everyone kind of early stages?

**Ankor**: [28:13] I think everyone's in the early stages, just because agents haven't become standard. As you can kind of step back, right, on the cusp of this. So across any organization, if you have less than 100 different workflows where agents are deployed, you're finding different verification strategies, and they are over the next 12 to 18 months going to come together into standardized layers.

**Host**: [28:34] I think that makes sense. I know we're at time. Thank you so much.

**Ankor**: [28:39] Great. Any other questions you know how to reach us. We will be very responsive on anything around AI. Thank you very much for this opportunity. We loved it.

**Host**: [28:50] Appreciate it too. Take care.
