DataParallel memory consumption in PyTorch 0.4

I see — in that case it appears that my code is working as expected. Thank you for following up, and glad that you found that NCCL1 related issue!

Thanks SimonW (and mrdrozdov), I’m happy to see progress on this! I saw this problem in code that attempted to broadcast a single large (not quite 8GB, on a machine with two gpus each having 12GB of memory) tensor and a few other smaller ones. Currently the machine on which I observed the problem is running pytorch 0.4.0 and torch.cuda.nccl.version() reports 2115. I don’t think we upgraded pytorch on this box since I reported the issue, and we certainly did not deliberately upgrade nccl.

Are you saying that your patch fixes this issue, making it possible to broadcast large tensors? Also, am I correct in assuming that to get the patch right now I’d need to pull pytorch master from github?

The patch probably fixes a different issue than the one you have. Specifically, it fixes an issue where if NCCL 1 is used to broadcast a single Tensor larger with more than 1<<31-1 elements, it will raise an opaque error message (not OOM). So it is different. I tried to reproduce your issue with NCCL 2 but wasn’t able to.

By the way, how did you get your PyTorch install? We ship our binaries with NCCL 1, not NCCL 2 AFAIK.

OK, still good to know there are eyeballs on this. I don’t recall whether I set up the relevant virtualenv by running “pip install torch” or by using pip to install from a wheel directly from the pytorch web site. I can tell you that I just now (on another machine, in a fresh virtualenv) installed pytorch using the recommended pip3 command (for Linux with python3.6 and CUDA 9.2), and there torch.cuda.nccl.version() reports version 2213. We also use (Ubuntu) apt-get to install cuda packages on all our machines with GPU’s; could that be where we are getting NCCL 2 from?

Ah sorry, I was mistaken. I just found out that our binaries ship with NCCL 2. By the way, have you tried 0.4.1? Is the issue reproducible there?

I’ve just verified that the issue persists on 0.4.1. To be more precise, broadcast_coalesced still runs out of gpu memory when given more than one nn.Parameter to broadcast, even though there is enough memory on both gpu’s to hold the copied parameters as long as no unnecessary copying is done.

@SimonW
Hello, I am facing the same issue.
I have 2 gpus both with 32GB of memory. I ran my code with a linear layer of size ~ 86000 * 8192, so it’s a pretty big tensor. I got my error on broadcasting. (I don’t get if I have only 1 gpu, I get it after few iterations of training and val).
So I reduced the size to 86000 * 2048 and with this I don’t get any error using 1 gpu and runs and finishes correctly (nvidia-smi shows only 12GB of memory being used). With 2 gpus I now get error on loss.backward(). Just before the error I run nvidia-smi and 1 gpu is using 6GB (makes sense 12/2) of memory and other goes to 32GB (No idea whats going on). I feel there’s unnecessary copying being done because I’ve seen posts that talk about imbalance memory use with DataParallel but my point is that the imbalance shouldn’t be such high and if it works with 1 gpu then it should with 2. Are there extra memory requirements with 2 gpus? (Think how it works with 1 but not 2)

I ran your commands above and this is the output

print(torch.cuda.nccl.is_available(torch.randn(1).cuda()))
True
print(torch.cuda.nccl.version())
2408
x = torch.randn(3).cuda()
ys = torch.cuda.comm.broadcast(x, [0, 1])
print(x.storage().data_ptr())
139984923262976
print(ys[0].storage().data_ptr())
139984923262976
Python 3.7.4