Conference paper
Causally Reliable Concept Bottleneck Models
Giovanni De Felice, Arianna Casanova Flores, et al.
NeurIPS 2025
This talk will focus on designing and evaluating agentic benchmarks with a strong emphasis on in-domain evaluation and real-world task reliability. Drawing from the development of AssetOpsBench, we’ll discuss practical considerations for measuring agent behavior, task completion quality, and decision robustness. The session will highlight what works, what doesn’t, and what matters most when building benchmarks for agent-based systems.
Giovanni De Felice, Arianna Casanova Flores, et al.
NeurIPS 2025
Sarath Swaminathan, Nathaniel Park, et al.
NeurIPS 2025
Xavier Gonzalez, Leo Kozachkov, et al.
NeurIPS 2025
Max Esposito, Besart Shyti
NeurIPS 2025