‘broadcasting’ operation occupies too much CUDA memory?

When I ran my code with Pycharm, it often gives me a error “CUDA out of memory…”. So I tried to find the reason by code debugging. Mostly, it requires 4513m memory to run the code, but when it comes to the first line (i.e., s=1.0/(1.0+torch.sum((c.unsqueeze(1)-self.msfcae.scale1_mu)**2,dim=2)/self.opt.alpha)) of the following function, the memory will rise to about 10613m. Then I added a line of ‘torch.cuda.empty_cache()’, memory usage will reduce to 7917m, but it will not fundamentally change the situation. It will also give an error in next epoch since the memory requirement is gradually increasing. I thought this may derived from the existance of broadcasting, since c.unsqueeze(1) and self.msfcae.scale1_mu have different shapes. By step-wise debugging, I found torch.sum((c.unsqueeze(1) - self.msfcae.scale1_mu) will increase the memory requirement. Then the square ** will further increase the memory requirement. I do not know why the first line of this function will cost much memory, and could anyone please give me some suggestions to avoid this? Thanks a lot!

def scale1_soft_assign(self,c):
        s = 1.0 / (1.0 + torch.sum((c.unsqueeze(1) - self.msfcae.scale1_mu) ** 2, dim=2) / self.opt.alpha)
        torch.cuda.empty_cache()
        s = s ** (self.opt.alpha + 1.0) / 2.0
        s = s / torch.sum(s, dim=1, keepdim=True)
        return s

Could you calculate the expected memory usage based on the input shapes and the bradcasting?
Are you seeing too much memory usage or would it fit the expected size?

In addition, I debugged the ‘forward’ in https://github.com/eelxpeng/dec-pytorch/blob/master/lib/dec.py that has the same function with our ‘scale1_soft_assign’ and found that the function does not obviously increase the memory.

Hi, thanks for your reply. Should the size be responsible for the cuda usage? The sizes of c.unsqueeze(1) and self.msfcae.scale_mu are respectively [3997696,1,10] and [20,10]. How to calculate the expected memory usage? Or could you please allow me to ask a question: why does it require extra cuda memory for minus operation on two existed cuda tensor? Will the required memory be released in next epoch or increased epoch by epoch? I do not understand the mechanisms behind this. Anyway, I should express my thanks for your continuous help.

Broadcasting should not increase the memory usage, but you would of course need to store the result.
Here is a small code snippet with similar shapes:

x = torch.randn(30000, 1, 10).cuda()
y = torch.randn(20, 10).cuda()
#pprint.pprint(torch.cuda.memory_stats())
print('before sub')
print('mem expected in MB: ', (x.nelement() + y.nelement()) * 4 / 1024**2)
print('mem allocated in MB: ', torch.cuda.memory_allocated() / 1024**2)
print('max mem allocated in MR: ', torch.cuda.max_memory_allocated() / 1024**2)

res = x - y
print('after sub')
print('mem expected in MB: ', (x.nelement() + y.nelement() + res.nelement()) * 4 / 1024**2)
print('mem allocated in MB: ', torch.cuda.memory_allocated() / 1024**2)
print('max mem allocated in MR: ', torch.cuda.max_memory_allocated() / 1024**2)

Output:

before sub
mem expected in MB:  1.145172119140625
mem allocated in MB:  1.1455078125
max mem allocated in MR:  1.1455078125
after sub
mem expected in MB:  24.033355712890625
mem allocated in MB:  24.03369140625
max mem allocated in MR:  24.03369140625

As you can see, the expected memory matches the allocated and max. allocated memory closely.

The memory usage should not increase over time, if you are using the same shapes.
If you see an increase in memory usage while your model is training, you might accidentally store some tensors with the attached computation graph, e.g. by doing losses.append(loss) during training.
If that’s the case, you should either .detach() the tensors or call .item() on them to get a Python literal.

Your example is really a big help for my understanding. I am not a native English speaker, but I really want to express my sincere thanks for your continuous support!

Good to hear the example is helpful.
Neither am I a native English speaker and we still communicate quite well, so don’t worry about it. :wink: