Training a model that uses Zero3 (deepspeed, equivalent to fsdp) along with other forms of parallelism (tensor + sequence).
What I’ve noticed is that as I increase the sequence length of the model the throughput becomes highly unstable:
logging cuda.memory_stats and specifically the counters num_device_alloc, num_alloc_retries, num_device_free, I see that these counters grow over time, and that stats such as reserved_bytes.all.current fluctuate.
this is in contrast to the same model but run at a smaller sequence length, where the counts are 0 and the reserved_bytes hits a plateau and stabilizes (and where throughput is stable throughout).
I’ve already tried setting expandable_segments. Are there other things to try given memory thrashing / fragmentation?
Training a model that uses Zero3 (deepspeed, equivalent to fsdp) along with other forms of parallelism (tensor + sequence).
What I’ve noticed is that as I increase the sequence length of the model the throughput becomes highly unstable: nextcare urgent care login
logging cuda.memory_stats and specifically the counters num_device_alloc, num_alloc_retries, num_device_free, I see that these counters grow over time, and that stats such as reserved_bytes.all.current fluctuate.
this is in contrast to the same model but run at a smaller sequence length, where the counts are 0 and the reserved_bytes hits a plateau and stabilizes (and where throughput is stable throughout).
I’ve already tried setting expandable_segments. Are there other things to try given memory thrashing / fragmentation?
Hello,
Here are few additional steps to consider: enable ZeRO-3’s offloading capabilities to shift optimizer and gradient states to CPU or NVMe memory, which can help reduce GPU memory usage; try reducing the batch size to stabilize memory usage; enable gradient checkpointing, which saves memory at the expense of extra computation; monitor memory usage using tools like cuda.memory_stats to identify unusual patterns; and ensure you are using the latest version of DeepSpeed, as updates often come with performance improvements and bug fixes.
Let me know if this helps!