FSDP is using more GPU memory than DDP

Hi team,

I have one model, trained 2 times. One under DDP and one under FSDP. And I have enabled activation checkpointing for them both. With DDP, I can use batch_size==512. However, with FSDP, I can only use batch_size==256.

After some profile, I found SplitWithSizesBackward is suspicious. There are large chunks of cudaMalloc there.

Can anyone share some insights about this function or this pattern? I would appreciate that.

The OOM happens during the backward.

   total_loss.backward()
  File "/usr/local/lib/python3.8/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 308.00 MiB. GPU 7 has a total capacty of 39.56 GiB of which 188.81 MiB is free. Process 457865 has 39.37 GiB memory in use. Of the allocated memory 36.82 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-12-02 23:10:31,719][thrift_logger][INFO] - Thrift loggers for 0 topics are closed
>=1 process exited with error return code 1, which will cause training to hang. Killing remaining process groups: [464460, 464461, 464463, 464464]

I also find this problem. When I use fsdp and three gpus to fine-tune llama3 vision model, I find every gpu(fsdp) has a higher memory than others(every gpu in ddp and single gpu). I get OOM error and it happens in _flat_param.py when flattening tensors. What’s more, even I can load model when using fsdp, it has higher memory than ddp.
Can you have some solution to solve this problem or it may be a bug?

Double post from here.