Memory thrashing: `num_alloc_retries` keeps growing, `reserved_bytes` fluctuates

Training a model that uses Zero3 (deepspeed, equivalent to fsdp) along with other forms of parallelism (tensor + sequence).

What I’ve noticed is that as I increase the sequence length of the model the throughput becomes highly unstable:

  • logging cuda.memory_stats and specifically the counters num_device_alloc, num_alloc_retries, num_device_free, I see that these counters grow over time, and that stats such as reserved_bytes.all.current fluctuate.
  • this is in contrast to the same model but run at a smaller sequence length, where the counts are 0 and the reserved_bytes hits a plateau and stabilizes (and where throughput is stable throughout).

I’ve already tried setting expandable_segments. Are there other things to try given memory thrashing / fragmentation?

Training a model that uses Zero3 (deepspeed, equivalent to fsdp) along with other forms of parallelism (tensor + sequence).

What I’ve noticed is that as I increase the sequence length of the model the throughput becomes highly unstable: nextcare urgent care login

  • logging cuda.memory_stats and specifically the counters num_device_alloc, num_alloc_retries, num_device_free, I see that these counters grow over time, and that stats such as reserved_bytes.all.current fluctuate.
  • this is in contrast to the same model but run at a smaller sequence length, where the counts are 0 and the reserved_bytes hits a plateau and stabilizes (and where throughput is stable throughout).

I’ve already tried setting expandable_segments. Are there other things to try given memory thrashing / fragmentation?

Hello,
Here are few additional steps to consider: enable ZeRO-3’s offloading capabilities to shift optimizer and gradient states to CPU or NVMe memory, which can help reduce GPU memory usage; try reducing the batch size to stabilize memory usage; enable gradient checkpointing, which saves memory at the expense of extra computation; monitor memory usage using tools like cuda.memory_stats to identify unusual patterns; and ensure you are using the latest version of DeepSpeed, as updates often come with performance improvements and bug fixes.
Let me know if this helps!

Thanks - fixed the issue by profiling memory usage and reducing per layer memory requirements slightly.

Options such as updating deepspeed or offloading are not feasible given need for stable training env and throughput requirements.