Pytorch broadcasting operation takes too much gpu memory

Hi, I’m trying to calculate a specific loss, which needs to do the broadcasting operation, however I found it takes too much GPU memory, can anyone help me with it? I’ll show the example code below:

x = torch.randn(32, 7862, 1, 3).cuda()
y = torch.randn(32, 1, 7862, 3).cuda()
loss = torch.linalg.norm(x - y, dim = -1)

print(‘before sub’)
print('mem expected in MB: ', (x.nelement() + y.nelement()) * 4 / 1024 ** 2)
print('mem allocated in MB: ', torch.cuda.memory_allocated() / 1024 ** 2)
print('max mem allocated in MB: ', torch.cuda.max_memory_allocated() / 1024 ** 2)

print(‘after sub’)
print('mem expected in MB: ', (x.nelement() + y.nelement() + loss.nelement()) * 4 / 1024 ** 2)
print('mem allocated in MB: ', torch.cuda.memory_allocated() / 1024 ** 2)
print('max mem allocated in MB: ', torch.cuda.max_memory_allocated() / 1024 ** 2)

Many thanks!

@ptrblck hello sir, could you please help me with this issue? Many thanks!

Could you explain why you think this operation takes “too much” memory?
In your example you are broadcasting the tensors to tmp = x - y with a shape of [32, 7862, 7862, 3], which will use ~22GB of memory in float32.

Sorry I describe it wrongly, my question is, is there any way to reduce such large GPU memory? Many Thanks!

The final output shape of loss will be [32, 7862, 7862], so around ~7GB. You could use a for loop to iterate e.g. dim2 in y and calculate the norm repeatedly. However, the loop could cause a massive slowdown.
Let me check if there are more elegant approaches.

Hello sir, thank you for your advice, I will try the iteration of dim2 later to see if it could work, and I will keep an eye on this post to see if there are elegant approaches.

Thank you again for your help.