Additional information (if it helps): I tried repeating the experiment SimonW suggested, but with a larger tensor to broadcast. Up to a tensor of size roughly 10GB the broadcast succeeds, but at 11GB there is an out of memory error (and I’m working with two GPUs each having just over 12GB of memory):
In [1]: import torch
In [2]: x = torch.randn(int(2.75e9)).cuda()
In [3]: ys = torch.cuda.comm.broadcast(x, [0, 1])
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
RuntimeError Traceback (most recent call last)
in ()
----> 1 ys = torch.cuda.comm.broadcast(x, [0, 1])
/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/cuda/comm.py in broadcast(tensor, devices)
19 corresponding to indices from devices
.
20 “”"
—> 21 return torch._C._broadcast(tensor, devices)
22
23
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
So, is there just an overhead of about 1GB of temporary storage that gets allocated on the source GPU of the broadcast? If so, is there any way to reduce it?
OTOH, it occurs to me to add that the total size of all the parameters of the actual model I’m using is only about 7GB, so evidently there is additional overhead when replicating that model as opposed to just broadcasting a single tensor as in this experiment. I wish I knew what to make of that.
OK, a little more data: Consider the difference between
In [1]: import torch
In [2]: inputs = [torch.nn.Parameter(torch.zeros(1).cuda()), torch.nn.Parameter(torch.zeros(int(2e9)).cuda())]
In [3]: outputs = torch.cuda.comm.broadcast_coalesced(inputs, [0, 1])
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
RuntimeError Traceback (most recent call last)
in ()
----> 1 outputs = torch.cuda.comm.broadcast_coalesced(inputs, [0, 1])
/iscsi/rdata/rbeaudoin/projects/sme-memory/lib/python3.6/site-packages/torch/cuda/comm.py in broadcast_coalesced(tensors, dev
ices, buffer_size)
38 corresponding to indices from devices
.
39 “”"
—> 40 return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
41
42
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
and
In [1]: import torch
In [2]: inputs = [torch.nn.Parameter(torch.zeros(int(2e9)).cuda())]
In [3]: outputs = torch.cuda.comm.broadcast_coalesced(inputs, [0, 1])
In [4]:
(No error in the second case.) I also tried two bare tensors rather than Parameters, and in that case there is no out of memory error either. Note that the total size of a vector of float32’s of length 2e9 is about 8GB. So it looks like something causes quite a bit of memory overhead when broadcast_coalesced is fed a list of more than one Parameter, for whatever that is worth.