Using a single H100 (80GB) on Llama 3.2 70B (INT4 quantized):
The new driver introduces an experimental feature allowing for "Direct System Access." This allows the GPU to page in data directly from the system’s NVMe storage or RAM without buffering through the CPU’s L3 cache. This is a watershed moment for Deep Learning training. By effectively bypassing the traditional Z-copy bottlenecks, model training times for Large Language Models (LLMs) are projected to decrease not because the GPU is faster, but because it is starving less. The narrative of the "data starving GPU" is finally being addressed at the driver level.
This exclusive report breaks down the latest release, the ongoing transition to the Blackwell Ultra architecture, and the newly revealed "Green Contexts" that are redefining GPU resource management. The Arrival of CUDA Toolkit 13.2.1