Why AI Workflows Give Different Answers Every Time

Blog

Why AI Workflows Give Different Answers Every Time — and How to Make Them Deterministic

Written by:

Chris

Walker

VP, Head of Product Marketing

Reading time:

min

Published:

June 10, 2026

Text Link

The short answer: AI workflows drift because language models are probabilistic. Every time you describe a task in plain language you leave gaps, and the model fills them with a slightly different — but equally reasonable — interpretation on each run. The fix isn't a more deterministic model. It's a complete specification: validate the system's interpretation of your request before it executes, and the same workflow returns the same answer every time.

Most teams adopting AI for analytics run into the same uncomfortable surprise. They build an automated workflow, it works beautifully in the demo, and then a month later someone notices the numbers don't match. Same question. Same data. Different answer. Nobody changed anything, yet the output drifted.

This isn't a bug you can patch. It's a structural property of how conversational AI works, and understanding it is the key to building workflows you can actually trust to run your business.

The same question, run three times, returns three defensible answers — because the model reinterprets the gaps each time.

query
"share growth vs competitors this Q?"
run 1
+2.3 pts
run 2
−0.8 pts
run 3
+1.1 pts
avg. +0.87 pts across 3 runs

The promise and the catch

The industry has converged on a term for what people want here: agentic workflows. The idea is simple and appealing. Instead of an analyst manually pulling data, segmenting it, applying business rules, and assembling a report every week, you describe the process once and let AI agents execute it on demand — reliably, consistently, the same way every time.

That last part is where things break down.

Large language models are probabilistic by design. When you describe a task in plain language, the model fills in the gaps you left unspecified — and you always leave gaps. The model makes a reasonable interpretation, executes, and produces a result. Run it again tomorrow, and the model may make a slightly different reasonable interpretation. Both are defensible. Neither is wrong. But they don't match, and in a business context, "close enough but inconsistent" is often worse than useless.

A concrete example: one phrase, six hidden decisions

Consider a commercial pharma team running a brand performance analysis. They ask for "share growth versus competitors this quarter." Seems clear. But the model has to silently decide:

Growth versus the prior quarter, or the same quarter last year?
Which competitors — the top three by volume, or the full therapeutic class?
Volume share or value share?

Each of these is a judgment call, and the model makes a different call depending on phrasing, context, and chance. One run says the brand gained 2.3 points. Another says it lost ground. The underlying data never changed. Only the interpretation did.

Now multiply that across dozens of recurring workflows and hundreds of runs per quarter, and you have a real problem. Decisions get made on results that aren't reproducible. Audits become impossible. Trust erodes.

In a business context, "close enough but inconsistent" is often worse than useless.

The traditional fix, and why it hurts

The established answer to this problem is the DAG — a directed acyclic graph. If you've worked with data engineering or workflow orchestration tools, you've seen these: explicit, step-by-step flowcharts where every operation, condition, and path is defined in advance. Nothing is left to interpretation because nothing is interpreted. The logic is hard-coded.

DAGs are reliable. That's exactly why banks, insurers, and data teams have relied on them for decades. Run a well-built DAG a thousand times and you get the same result a thousand times.

But that reliability comes at a steep cost. Building a DAG requires technical expertise most business users don't have. Every edge case has to be anticipated and encoded up front. Changing the workflow means going back into the definition and rewiring it. The people who understand the business problem — the brand manager, the RevOps leader, the category analyst — usually can't build or modify these workflows themselves. They file a request and wait. The thing that was supposed to make analysis faster becomes another queue.

So teams face an unappealing choice. Conversational AI is accessible but inconsistent. DAGs are consistent but inaccessible. For a long time, you had to pick one.

The old trade-off — and the bridge across it: describe a workflow conversationally, run it deterministically.

CONVERSATIONAL
(accessible, but
inconsistent)
"..."
DAG
(consistent, but
inaccessible)
o-o-o-o-o
MISSIONS
describe like one,
runs like the other

The real root cause: under-specification

Here's the insight that unlocks a better path. The inconsistency problem isn't really about the AI being unpredictable. It's about the specification being incomplete.

When a user describes a workflow in their own words, the description is almost never detailed enough to execute unambiguously. People speak in shorthand because they carry the missing context in their heads. They know "30 days" means calendar days, that "blocked deals" means deals missing a champion, that "effectiveness" means incremental lift against a control group. The model doesn't know any of that. So it guesses — and a guess that changes between runs is the source of the drift.

The fix, then, isn't to make the AI more deterministic. It's to make the specification complete before anything runs.

Validate the interpretation before you execute

The approach we use in Tellius — in the agentic workflows we call Missions — is to insert a deliberate checkpoint between what the user describes and what the system executes.

When a user describes a workflow conversationally, the system doesn't immediately run it. Instead, it parses the request, generates its own structured interpretation of every step, and shows that interpretation back to the user — explicitly flagging the places where it had to make an assumption.

Picture a B2B SaaS RevOps leader automating a quarterly pipeline review. They type: "Pull all deals above 500K, segment by stage and region, flag anything that hasn't moved in 30 days, and surface the top blockers."

Rather than executing on its best guess, the system responds with something like:

Here's how I'm interpreting this workflow:

Filter opportunities with deal value greater than $500,000
Group by current stage and assigned region
Flag opportunities with no activity logged in the last 30 calendar days
For each flagged deal, extract the most recent activity note as the blocker

Should I adjust any of these definitions before running?

Now the ambiguity is visible. The user can correct it in seconds: "By blockers I mean missing champion contacts, not the last note." The system updates its interpretation, and only then does the workflow get locked in.

When the planner finally executes, it isn't interpreting anymore. It's working from a comprehensive, validated specification — a blueprint with the guesswork already resolved. Underneath, a deterministic reasoning engine runs that locked specification the same way on every pass: the math is computed, not generated, so it doesn't drift between runs. And because every step traces back to defined logic, the workflow is governed and traceable — you can show a CFO exactly how a number was produced. Run it a hundred times and you get the same result a hundred times, because there's nothing left to guess.

Key takeaway: Determinism doesn't come from a better model. It comes from resolving ambiguity before execution — while a human is still in the loop and a thirty-second clarification can save a quarter of bad decisions.

Describe → confirm → lock → run. The checkpoint at step two is what makes every downstream run reproducible.

DESCRIBE
plain
language
chat
→
CONFIRM
validate the
interpretation
[x]
→
LOCK
specification
fixed
lock
→
RUN
same result
every time
===

Conversational to build, deterministic to run

What makes this work is that it captures the benefits of both worlds without the usual trade-off. The user never builds a DAG, never writes code, never anticipates edge cases in the abstract. They describe what they want in plain language and confirm that the system understood — which is a far more natural way to catch errors than reading a flowchart. Underneath, the validated spec behaves like a DAG: explicit, complete, and reproducible. Conversational to build. Deterministic to run.

The payoff shows up wherever workflows recur and decisions depend on them:

Commercial pharma. Brand teams run the same market and share analysis week after week, confident the results moved because the market moved — not because the model reinterpreted the question. When NBRx softens in a territory, the investigation that explains why is identical in method every time, so the trend is real and not an artifact of phrasing.
Brand management. Promotional effectiveness analysis becomes repeatable and auditable. When finance asks how a lift number was calculated, you can point to the locked specification and the defined logic behind every step — not a one-off interpretation that may not survive a re-run.
CPG. Category and trade-spend reviews produce consistent results across periods, which is the entire point of a period-over-period comparison. If the methodology silently shifts between quarters, the comparison is noise; when it's fixed, the movement is signal.
B2B SaaS RevOps. Pipeline health checks and forecast rollups return the same logic every quarter, so leadership trusts the trend line instead of relitigating the methodology in every QBR. The conversation moves from "is this number right?" to "what do we do about it?"

The lesson generalizes well beyond any single product. As more teams move analytical work onto AI agents, the differentiator won't be which model you use. It will be whether your workflows produce the same answer twice. The way to get there isn't to fight the probabilistic nature of language models — it's to resolve the ambiguity before execution, while the human is still in the loop.

Conversation is the right interface for describing intent. Determinism is the right behavior for executing it. The trick is knowing where one ends and the other begins.
‍

Try Tellius today →

‍

Get release updates delivered straight to your inbox.

No spam—we hate it as much as you do!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Watch Now:

See how Tellius Agent Mode acts as your always-on AI analyst

Watch Video

FAQ

Get the answers to some of our most frequently asked questions

Contact

What does it mean for an AI workflow to be deterministic?

A deterministic workflow produces the same result every time it runs on the same data. With most AI agents, that isn't guaranteed: the model reinterprets any ambiguity in your request on each run, so results drift. A workflow becomes deterministic when the specification is complete — every definition, filter, and edge case resolved — so there is nothing left for the model to guess at execution time.

Why do AI agents give different results for the same prompt?

Because language models are probabilistic. A plain-language prompt always leaves gaps (what counts as "this quarter," which competitors, which share metric), and the model fills those gaps with a reasonable interpretation that can vary from run to run. The data didn't change; the interpretation did.

Can you make LLM-based workflows reproducible?

Yes — but not by changing the model. The reliable path is to validate the system's interpretation of the request before it executes, lock that interpretation into a complete specification, and run it on a deterministic reasoning engine. The language model helps you build the spec conversationally; it isn't left to reinterpret the spec at runtime.

How is this different from a DAG?

A DAG gives you reproducibility but requires technical expertise to build and maintain, which puts business users in a queue. The validate-then-execute approach gives you the same reproducibility while letting non-technical users describe and confirm the workflow in plain language. You get the accessibility of conversation and the consistency of a hard-coded pipeline.

Tellius Kaiya vs. Glean, Hebbia, Snowflake Cortex, and DIY RAG: A Buyer's Guide to Agentic Analytics Across Structured and Unstructured Data

This buyer's guide compares Tellius Kaiya, Glean, Hebbia, Snowflake Cortex, and DIY RAG approaches across structured analytics, unstructured document intelligence, agent orchestration, governance, semantic understanding, explainability, and total cost of ownership. Learn where enterprise search tools excel, where warehouse-native AI fits, where custom RAG stacks create maintenance challenges, and why a dedicated agentic analytics platform may be the best choice for organizations looking to automate investigation, root-cause analysis, and decision-making across both structured and unstructured data.

Why AI Workflows Give Different Answers Every Time — and How to Make Them Deterministic

The promise and the catch

A concrete example: one phrase, six hidden decisions

The traditional fix, and why it hurts

The real root cause: under-specification

Validate the interpretation before you execute

Conversational to build, deterministic to run

Get release updates delivered straight to your inbox.

See how Tellius Agent Mode acts as your always-on AI analyst

Tellius Kaiya vs. Glean, Hebbia, Snowflake Cortex, and DIY RAG: A Buyer's Guide to Agentic Analytics Across Structured and Unstructured Data

FAQ

Related blog posts

Tellius Kaiya vs. Glean, Hebbia, Snowflake Cortex, and DIY RAG: A Buyer's Guide to Agentic Analytics Across Structured and Unstructured Data