DataParallel memory consumption in PyTorch 0.4

I believe I’m seeing a certain loss of functionality after upgrading from PyTorch 0.3.1 to 0.4.0. Specifically I’m trying to use nn.DataParallel to train, on two GPU’s, a model with a parameter that takes up over half the memory of either GPU. When the DataParallel library code attempts to replicate the model over both GPU’s it broadcasts the parameters to both, and runs out of GPU memory during the broadcast operation. In 0.3.1 the broadcast operation was implemented in Python, and contained logic to avoid copy operations for which the source and target devices are the same; in 0.4 this operation has been moved into C code, and I’m guessing that this optimization was removed in the process.

If any of the PyTorch developers is reading this, can you verify that my guess is correct? More important, can anyone suggest a work-around for this problem? Am I missing some way to use DataParallel (either the module or the functional form) that will work to replicate a model on multiple GPU’s even when it has a parameter that takes up over half the memory of any single GPU? (As I write this it occurs to me that someone may object that the gradients for such a model won’t fit in GPU memory anyhow, but that isn’t so, thanks to the wonderful feature of sparse tensor gradients.)

Thanks in advance for any light you can shed.

Your are right in that there is a regression when we broadcast the tensors and I’m working on a fix. However, it should really only happen when you don’t have NCCL, and we ship with NCCL1 in binary… So I am a bit confused. Could you tell me how you installed PyTorch?

Hi simonW I am not the original poster but I have downloaded windows 10 official build from the main site and would run into warnings on how my pytorch was not build from NCCL. I don’t have screen caps but I am positive I have ran into this problem. I haven’t touched pytorch in a while but I am fairly positive. Original version I had was from peterjc’s, once I saw official windows support I have since upgraded to 4.0.

Thanks for the report. @peterjc123 Was our Windows binary built with NCCL1/2?

Thanks for such a quick response! I used pip to install PyTorch into a python3 virtual environment (as I recall, nothing fancier than “pip install torch”). If it matters, this is with Python 3.6.5 on Ubuntu 16.04; let me know if you need more details. Should I be trying to install NCCL myself?

It should be included. Could you try running

print(torch.cuda.nccl.is_available(torch.randn(1).cuda()))
print(torch.cuda.nccl.version())

and tell me the output?

@inkplay it’d be great if you can run the above lines as well!

Here you go:

Python 3.6.5 (default, May 16 2018, 13:09:02)
Type ‘copyright’, ‘credits’ or ‘license’ for more information
IPython 6.4.0 – An enhanced Interactive Python. Type ‘?’ for help.

In [1]: import torch

In [2]: print(torch.cuda.nccl.is_available(torch.randn(1).cuda()))
True

In [3]: print(torch.cuda.nccl.version())
2115

Can you also do this for me?

x = torch.randn(3).cuda()
ys = torch.cuda.comm.broadcast(x, [0, 1])  # assuming you have >= 2 GPUs
print(x.storage().data_ptr())
print(ys[0].storage().data_ptr())
1 Like

Python 3.6.5 (default, May 16 2018, 13:09:02)
Type ‘copyright’, ‘credits’ or ‘license’ for more information
IPython 6.4.0 – An enhanced Interactive Python. Type ‘?’ for help.

In [1]: import torch

In [2]: x = torch.randn(3).cuda()
…: ys = torch.cuda.comm.broadcast(x, [0, 1]) # assuming you have >= 2 GPUs
…: print(x.storage().data_ptr())
…: print(ys[0].storage().data_ptr())
…:
…:
139669103522816
139669103522816

In [3]:

Hmm then it is really weird because broadcast on dense contiguous tensors don’t copy on the current device as expected. Could you double check that OOM happens when broadcasting the modules?, i.e. in replicate(...) of replicate.py

I’m working on that now. (It takes a bit to run my code; the model is big.) I can tell you from memory that the offending call is to broadcast_coalesced, and that it happens inside Broadcast.forward, which is called (via Broadcast.apply) from replicate. It might take me 'till tomorrow to get you a call stack.

Hi. The Windows version is built without NCCL. And neither is it ready for Windows.

C:\Anaconda3\lib\site-packages\torch\cuda\nccl.py:24: UserWarning: PyTorch is not compiled with NCCL support
  warnings.warn('PyTorch is not compiled with NCCL support')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-fdbbe4fefe36> in <module>()
      1 import torch
      2 print(torch.cuda.nccl.is_available(torch.randn(1).cuda()))
----> 3 print(torch.cuda.nccl.version())

C:\Anaconda3\lib\site-packages\torch\cuda\nccl.py in version()
     29 
     30 def version():
---> 31     return torch._C._nccl_version()
     32 
     33 

AttributeError: module 'torch._C' has no attribute '_nccl_version'```

Installed pytorch using conda method,  from the main site Windows -> conda -> 3.6 -> cuda 8.0

Edited: Just read the replies from peterjc and OP, it makes sense now.

OK, sorry about the delay, but here is a traceback showing the memory failure:

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File “semantic_matching.py”, line 849, in
train_model(model, dataset)
File “semantic_matching.py”, line 766, in train_model
pos_energy = parallel_model(*pos_batch)
File “/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 113, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File “/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 118, in replicate
return replicate(module, device_ids)
File “/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/nn/parallel/replicate.py”, line 12, in replicate
param_copies = Broadcast.apply(devices, *params)
File “/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/nn/parallel/_functions.py”, line 17, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File “/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/cuda/comm.py”, line 40, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

I hope this helps.

Additional information (if it helps): I tried repeating the experiment SimonW suggested, but with a larger tensor to broadcast. Up to a tensor of size roughly 10GB the broadcast succeeds, but at 11GB there is an out of memory error (and I’m working with two GPUs each having just over 12GB of memory):

In [1]: import torch

In [2]: x = torch.randn(int(2.75e9)).cuda()

In [3]: ys = torch.cuda.comm.broadcast(x, [0, 1])
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory

RuntimeError Traceback (most recent call last)
in ()
----> 1 ys = torch.cuda.comm.broadcast(x, [0, 1])

/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/cuda/comm.py in broadcast(tensor, devices)
19 corresponding to indices from devices.
20 “”"
—> 21 return torch._C._broadcast(tensor, devices)
22
23

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

So, is there just an overhead of about 1GB of temporary storage that gets allocated on the source GPU of the broadcast? If so, is there any way to reduce it?

OTOH, it occurs to me to add that the total size of all the parameters of the actual model I’m using is only about 7GB, so evidently there is additional overhead when replicating that model as opposed to just broadcasting a single tensor as in this experiment. I wish I knew what to make of that.

OK, a little more data: Consider the difference between

In [1]: import torch

In [2]: inputs = [torch.nn.Parameter(torch.zeros(1).cuda()), torch.nn.Parameter(torch.zeros(int(2e9)).cuda())]

In [3]: outputs = torch.cuda.comm.broadcast_coalesced(inputs, [0, 1])
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory

RuntimeError Traceback (most recent call last)
in ()
----> 1 outputs = torch.cuda.comm.broadcast_coalesced(inputs, [0, 1])

/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/cuda/comm.py in broadcast_coalesced(tensors, dev
ices, buffer_size)
38 corresponding to indices from devices.
39 “”"
—> 40 return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
41
42

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

and

In [1]: import torch

In [2]: inputs = [torch.nn.Parameter(torch.zeros(int(2e9)).cuda())]

In [3]: outputs = torch.cuda.comm.broadcast_coalesced(inputs, [0, 1])

In [4]:

(No error in the second case.) I also tried two bare tensors rather than Parameters, and in that case there is no out of memory error either. Note that the total size of a vector of float32’s of length 2e9 is about 8GB. So it looks like something causes quite a bit of memory overhead when broadcast_coalesced is fed a list of more than one Parameter, for whatever that is worth.

Any update here? I’m seeing a similar problem on Linux with pytorch installed through conda and CUDA 9.

@SimonW I ran these lines as you suggested:

x = torch.randn(3).cuda()
ys = torch.cuda.comm.broadcast(x, [0, 1])  # assuming you have >= 2 GPUs
print(x.storage().data_ptr())
print(ys[0].storage().data_ptr())

And had similar output, indicating that a copy isn’t happening during the broadcast. Is this a problem?

Hi,
I tried to reproduce according to your instruction, but couldn’t get OOM with the following code

MB = 1024*1024
N = int(10.5 * MB * 1024 / 4)

x = torch.zeros(N, device='cuda')
x1 = torch.zeros(1, device='cuda')
x = nn.Parameter(x)
x1 = nn.Parameter(x1)
ts = torch.cuda.comm.broadcast_coalesced([x, x1], [0, 1])

This is on 2 GPUs each with 12G memory. OOM triggers if I broadcast a 11GB tensor, but not when I broadcast a 10.5GB tensor and a 4 byte tensor.

There are two possibilities: 1. the bug was somehow fixed because I am using a master build. 2. the bug only happens with NCCL 1 because I’m using NCCL 2.

I’ll find a NCCL 1 build on Monday and see if I can reproduce.

Not copying is the expected behavior. What’s your pytorch version and NCCL version (obtained via torch.cuda.nccl.version())?

I found NCCL1 has a lower tensor size limit, and prepared this patch https://github.com/pytorch/pytorch/pull/11466 . However, this can really only happen if you have a huge tensor (not some tensors that adds up to be huge). Is this the case with your code?