Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

Ziqi Yuan; Haoyang Zhang; Yirui Eric Zhou; Apoorve Mohan; I-Hsin Chung; Seetharami Seelam; Jian Huang

NeurIPS 2025

Conference paper

02 Dec 2025

Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

Abstract

We present the design and implementation of a new lifetime-aware tensor offloading framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs). Our framework, TeraX, is developed explicitly for large language model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven by our observations that only a small fraction (<5%) of tensors in LLMs are active in each training iteration, many inactive tensors are large and will not be used for a long period of time, creating ample opportunities for offloading/prefetching them to/from slow SSDs without stalling the GPU training process. TeraX accurately estimates the lifetime of each tensor with the execution graph generated by PyTorch, based on which it will produce an optimized tensor offloading plan. TeraX has a runtime tensor migration engine to fulfill the offloading plan via GPUDirect storage that allows direct data transfer between GPUs and SSDs. In comparison with state-of-the-art studies such as ZeRO-Offload and ZeRO-Infinity, we demonstrate that TeraX improves the training performance of various LLMs, and achieves near ideal performance assuming unlimited GPU memory.

Workshop paper