Aisera, a leader in Agentic AI for enterprises, has unveiled a new benchmarking framework for evaluating AI agents in real-world enterprise applications. The study, co-authored with researchers from Stanford University, has been accepted for presentation at the ICLR 2025 Workshop on building trust in Large Language Models (LLMs). The CLASSic framework provides a holistic evaluation of AI agents beyond traditional accuracy-based metrics, addressing cost, latency, stability, security, and operational efficiency. Aisera plans to open-source this benchmark framework to foster industry-wide advancements in enterprise AI.
Why Benchmarking AI Agents Matters
Enterprise AI adoption has surged, yet existing benchmarks often fall short due to:
- Over-reliance on synthetic data, failing to reflect real-world complexity.
- Limited evaluation metrics, primarily focused on accuracy while neglecting operational factors.
- Lack of security assessment, exposing AI agents to adversarial vulnerabilities.
To ensure trustworthy, efficient, and scalable AI solutions, Aisera introduces the CLASSic framework, a comprehensive evaluation method for enterprise AI agents.
The CLASSic Framework: Five Dimensions
1. Cost Efficiency
- Measures API usage, token consumption, and infrastructure costs.
- Ensures AI solutions deliver business value without excessive overhead.
2. Latency Performance
- Evaluates end-to-end response times in real-world enterprise applications.
- Addresses bottlenecks that impact scalability and real-time interactions.
3. Accuracy & Workflow Execution
- Assesses AI agents’ correctness in understanding and executing tasks.
- Compares domain-specific models vs. general-purpose foundation models.
4. Stability & Consistency
- Measures AI agents’ robustness across varied inputs, industries, and conditions.
- Ensures agents maintain accuracy across repeated invocations.
5. Security & Risk Mitigation
- Tests resilience against adversarial prompts, injection attacks, and data leaks.
- Strengthens compliance and protects enterprise data integrity.
Findings: Domain-Specific AI Agents Outperform General Models
The research, conducted across banking, financial services, healthcare, edtech, and biotechnology, found that:
- Domain-specialized AI agents consistently outperformed those based purely on foundation LLMs.
- They delivered higher accuracy, stronger security, and lower operational costs.
- General-purpose LLMs, while competitive in accuracy, struggled with latency and cost-efficiency.
“The CLASSic framework serves as a pragmatic guide for enterprise AI adoption, ensuring AI agents are not just accurate but also cost-effective, stable, and secure.”
— Utkarsh Contractor, Field CTO, Aisera
ICLR 2025 Recognition & Open-Source Initiative
The International Conference on Learning Representations (ICLR), a premier AI research body, has accepted the study for presentation at its 2025 Workshop on building trust in LLMs.
Aisera will open-source the CLASSic framework, enabling enterprises, researchers, and AI developers to:
- Replicate and refine benchmarking methods.
- Compare AI agents across different industries.
- Drive innovation in AI agent development for enterprise applications.
“Evaluating AI agents on multiple dimensions is essential for unlocking their full value for enterprises. This is what the CLASSic framework aims to achieve.”
— Michael Wornow, PhD, Stanford University
With AI adoption accelerating, enterprise leaders need reliable evaluation methods to maximize AI efficiency and security. The CLASSic framework offers a holistic, standardized approach to assessing AI agents, ensuring:
- Operational cost efficiency.
- Scalable and real-time performance.
- Higher accuracy and stability.
- Stronger security against adversarial threats.
By open-sourcing this framework, Aisera is empowering businesses to adopt AI responsibly and effectively, setting a new benchmark for enterprise AI success.