here, a is a [3,256,256,256,256] matrix, b is [1,256,256,256,256], and I got an OOM when I run this code, I also try the in-place operation like a*=(a>c), although the memory consumption is reduced, I still got an OOM, even the (a>c).sum((3,4) consume a lot of memory. Is there any way to avoid OOM and implement the same functional?
a will already allocate 48GB in float32 and b16GB. Additionally a>c will create a temp. tensor, which should also consume 16GB, so you would be at 80GB to store these tensors alone.
Since the CUDA context and potentially other tensors would also need some memory, I don’t think you could fit it into a GPU without scaling down the problem.
Thank you for your reply.
What I want to do is similar to a ‘2-d convolution with a different kernel on each element location’, now I apply this by for-loop like this
d = torch.zeros(1,3,256,256).cuda()
for idx in range(val,color.shape[1]-val):
for idy in range(val,color.shape[2]-val):
b_block=b[idx-val:idx+val,idy-val:idy+val].clone()
a_block = a[...,idx - val:idx + val, idy - val:idy + val].clone()
c_1=b_block>c
rendered_img[...,idx-val, idy-val]=(c_1*a_block).sum((1,2))/c_1.sum()
Although this implementation can avoid OOM in this image scale, it is time-consuming and cannot handle images on a larger scale. Can you give me some advice on this? Or I have to trade-off between timing and memory? If so, what if don’t care about the timing?
I’m now scaling down the problem but the oom still occurs times, can I chunk the tensor ‘a’ and ‘b’ to multi-GPU? I have tried but it seems it doesn’t work.