Saurabh Paul, Christos Boutsidis, et al.
JMLR
Organizations deploying LLM inference often face critical decisions about hardware procurement, software stack selection, and deployment configurations. Today, these decisions are frequently made through ad-hoc testing, which consumes significant GPU resources and often leads to suboptimal outcomes. The diversity of deployment environments, model architectures, inference frameworks, and accelerator hardware makes exhaustive benchmarking impractical.
FMwork is a systematic benchmarking methodology that addresses this challenge by narrowing both the input configuration space and the output metrics space to focus on the most informative parameters and indicators. This targeted approach accelerates evaluation, reduces resource waste, and enables consistent, reproducible comparisons across platforms.
In a representative study, FMwork achieved over an order-of-magnitude reduction in total benchmarking time compared to a naïve exhaustive sweep, while capturing the key trends needed for deployment decisions across NVIDIA, AMD, and Intel GPUs. By providing an open, extensible framework, FMwork benefits the broader HPC and AI community through more efficient, sustainable performance evaluation.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A