Is there anyway to avoid oom?

I do an element-wise multiplication as:

    a = a.unfold(1, val*2, 1).unfold(2, val*2, 1)
    b=b.unsqueeze(dim=0).unfold(1, val*2, 1).unfold(2, val*2, 1)

here, a is a [3,256,256,256,256] matrix, b is [1,256,256,256,256], and I got an OOM when I run this code, I also try the in-place operation like a*=(a>c), although the memory consumption is reduced, I still got an OOM, even the (a>c).sum((3,4) consume a lot of memory. Is there any way to avoid OOM and implement the same functional?

a will already allocate 48GB in float32 and b 16GB. Additionally a>c will create a temp. tensor, which should also consume 16GB, so you would be at 80GB to store these tensors alone.
Since the CUDA context and potentially other tensors would also need some memory, I don’t think you could fit it into a GPU without scaling down the problem.

Thank you for your reply.
What I want to do is similar to a ‘2-d convolution with a different kernel on each element location’, now I apply this by for-loop like this

d = torch.zeros(1,3,256,256).cuda()
for idx in range(val,color.shape[1]-val):
        for idy in range(val,color.shape[2]-val):
            a_block = a[...,idx - val:idx + val, idy - val:idy + val].clone()
            rendered_img[...,idx-val, idy-val]=(c_1*a_block).sum((1,2))/c_1.sum()

Although this implementation can avoid OOM in this image scale, it is time-consuming and cannot handle images on a larger scale. Can you give me some advice on this? Or I have to trade-off between timing and memory? If so, what if don’t care about the timing?

If you don’t care about the timing, then the nested loop should work as it should use the least necessary memory.

I’m trying to add more constraints to scaling down the problem, thanks again for your reply.

I’m now scaling down the problem but the oom still occurs times, can I chunk the tensor ‘a’ and ‘b’ to multi-GPU? I have tried but it seems it doesn’t work.