As enterprises race toward autonomous AI agents, reliability—not capability—is emerging as the biggest barrier to adoption. New research from Appier aims to address that challenge by introducing a framework designed to measure and improve how large language models (LLMs) make decisions under risk.
The company’s latest research paper, “Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models,” proposes a systematic way to evaluate AI decision behavior when incorrect responses carry real-world consequences. The study introduces a new methodology that quantifies decision quality across different risk scenarios—an approach the company says can significantly improve the reliability of autonomous AI systems.
The findings come as organizations increasingly move beyond AI copilots toward Agentic AI systems capable of acting independently within enterprise workflows.
The Reliability Problem Facing Agentic AI
AI agents are quickly becoming the next frontier of enterprise automation.
Unlike traditional AI tools that assist human decision-making, Agentic AI systems can take actions on behalf of users—executing tasks, coordinating workflows, and making decisions across digital systems.
But autonomy introduces a critical challenge: trust.
If an AI agent confidently produces a wrong answer or takes an incorrect action, the consequences can be significant—particularly in fields like marketing automation, financial decision-making, or customer engagement.
Industry data highlights the scale of the issue. A 2025 survey by McKinsey & Company found that 62% of organizations are already experimenting with AI agents, yet inaccuracy remains the most frequently cited concern when deploying AI in enterprise settings.
That problem often manifests as AI hallucinations—responses that sound plausible but contain incorrect or fabricated information.
Appier’s research focuses on a related but less discussed challenge: how AI systems decide whether to answer at all.
Beyond Correct vs. Incorrect Answers
Traditional LLM evaluations measure performance primarily through accuracy.
In other words, they test whether a model’s answer is correct.
But real-world enterprise scenarios are rarely that simple.
Sometimes the safest choice is not to answer at all—especially when the AI is uncertain and the cost of being wrong is high.
Appier’s research argues that existing evaluation frameworks fail to capture this nuance.
For example:
- In high-risk situations, a wrong answer might cause financial or reputational damage.
- In low-risk situations, refusing to answer may frustrate users or slow workflows.
The optimal decision therefore depends on context and risk tolerance, not just raw accuracy.
To reflect this reality, the research introduces a Risk-Aware Decision-Making framework that evaluates how well models balance three possible actions:
- providing an answer
- refusing to answer
- making an informed guess
The framework then measures whether the AI’s choice maximizes expected value under the given risk conditions.
Turning Risk Into Measurable Metrics
A key innovation in the research is translating risk into quantifiable parameters.
Instead of treating all answers equally, the framework assigns different values to outcomes:
- Rewards for correct answers
- Penalties for incorrect responses
- Costs associated with refusing to answer
These variables simulate realistic enterprise environments where mistakes carry different levels of impact.
Within this structure, a language model must evaluate three factors before deciding how to respond:
- Capability – whether it can solve the task
- Confidence – how certain it is in the answer
- Risk conditions – the consequences of being wrong
The model’s decision quality is then measured by whether it chooses the option that maximizes expected reward.
In essence, the framework evaluates decision strategy, not just knowledge.
A Strategic Imbalance in Today’s AI Models
Using this framework, Appier’s research uncovered a surprising pattern across many leading language models.
Rather than consistently adapting to different risk environments, models often show strategic imbalance.
In high-risk scenarios—where incorrect answers carry heavy penalties—models tend to guess too often, risking costly mistakes.
Conversely, in low-risk environments, models may become overly cautious, refusing to answer even when a response would likely be correct.
This inconsistency reduces both the safety and usefulness of AI systems.
The study suggests the issue isn’t simply a lack of knowledge. Instead, it stems from a model’s difficulty integrating multiple reasoning capabilities—such as confidence estimation and outcome evaluation—into a stable decision strategy.
A Three-Step Method for Smarter Decisions
To address this problem, the research proposes a new method called Skill Decomposition.
Rather than forcing a language model to make a single all-in-one decision, the approach separates reasoning into three structured stages:
Task Execution
The model attempts to solve the task and generate an initial answer.
Confidence Estimation
The system then evaluates how confident it is in that answer.
Expected-Value Reasoning
Finally, the model weighs potential outcomes based on the risk environment before deciding whether to answer, refuse, or guess.
This layered reasoning process helps the model combine its knowledge with an assessment of uncertainty and consequences.
The result, according to the research, is more rational and stable decision behavior, particularly in high-risk enterprise environments.
Why Enterprises Care About Risk-Aware AI
For companies deploying autonomous AI agents, decision reliability is a foundational requirement.
Unlike AI copilots that operate alongside humans, agent-based systems may perform actions such as:
- executing marketing campaigns
- adjusting product recommendations
- interacting with customers
- triggering operational workflows
In these contexts, inaccurate decisions can quickly propagate across systems.
Risk-aware decision frameworks therefore serve as a form of governance layer, ensuring that AI agents act cautiously when necessary while still maintaining operational efficiency.
This balance is critical for organizations seeking to move beyond experimentation toward large-scale AI deployment.
From Research to Enterprise Platforms
Appier says the insights from its research are already being incorporated into its enterprise products.
The company has integrated the methodology into its Agentic AI-powered platforms, including:
- Ad Cloud for AI-driven advertising optimization
- Personalization Cloud for customer engagement automation
- Data Cloud for unified data intelligence and analytics
By embedding risk-aware decision logic into these systems, the company aims to help enterprises adopt autonomous workflows while maintaining reliability and trust.
The Bigger Shift Toward Autonomous AI
The research reflects a broader transformation underway across enterprise AI.
Early generative AI deployments focused on copilots—tools designed to assist human users.
But the industry is rapidly moving toward AI agents capable of independent action, often coordinating complex workflows across software systems.
Major technology providers and startups alike are investing heavily in agent frameworks, orchestration platforms, and autonomous decision engines.
However, autonomy introduces new governance challenges that go beyond traditional machine learning evaluation metrics.
Frameworks like Appier’s risk-aware methodology may become increasingly important as organizations seek ways to measure and regulate AI behavior in real-world environments.
The Bottom Line
Appier’s latest research tackles one of the most pressing questions in enterprise AI: how to ensure autonomous systems make decisions responsibly when the stakes are high.
By introducing a quantifiable framework for evaluating risk-aware decision-making, the company provides a new lens for assessing LLM behavior beyond simple accuracy metrics.
As enterprises move from AI copilots toward fully autonomous agents, tools that improve reliability and governance could play a critical role in unlocking the next wave of AI-driven business automation.
Power Tomorrow’s Intelligence — Build It with TechEdgeAI












