DistributedDataParallel consumes much more gpu memory

Would DistributedDataParallel wrapper cost much GPU memory? In my case, the model cost around 7300MB when loaded into a GPU. However, when wrapped in DistributedDataParallel and run in the distributed mode, it costs 22000MB GPU momery.
Is it caused by the DistributedDataParallel wrapper? Are there any methods to save memory usage? Thanks!

That’s a huge increase in memory. Which model and batch sizes are you using?

I am using pointpillars from this repo https://github.com/traveller59/second.pytorch, with DataParallel changed to DistributedDataParallel, the batch size is 2 per gpu.

How many nodes are you using and how many GPUs per node? Also, which communication backend are you using?

Also, it might be helpful to debug if you could share the code you’re using to initialize and train using DistributedDataParallel?