Hi guys,
I’m currently using nn.DataParallel for mutli-gpu (8-gpu) training in a single node. However, if I put the data and model to devices[0], I found the memory on GPU 0 will be huge and make the program exits (cuda out of memory) at the begining of training. Can anyone help?
BTW, I find if I use DistributedDataParallel, the memory is fine.
Environment:
pytorch 1.0.1
cuda9.0