here, a is a [3,256,256,256,256] matrix, b is [1,256,256,256,256], and I got an OOM when I run this code, I also try the in-place operation like a*=(a>c), although the memory consumption is reduced, I still got an OOM, even the (a>c).sum((3,4) consume a lot of memory. Is there any way to avoid OOM and implement the same functional?
a will already allocate 48GB in float32 and b16GB. Additionally a>c will create a temp. tensor, which should also consume 16GB, so you would be at 80GB to store these tensors alone.
Since the CUDA context and potentially other tensors would also need some memory, I don’t think you could fit it into a GPU without scaling down the problem.
Thank you for your reply.
What I want to do is similar to a ‘2-d convolution with a different kernel on each element location’, now I apply this by for-loop like this
d = torch.zeros(1,3,256,256).cuda()
for idx in range(val,color.shape-val):
for idy in range(val,color.shape-val):
a_block = a[...,idx - val:idx + val, idy - val:idy + val].clone()
Although this implementation can avoid OOM in this image scale, it is time-consuming and cannot handle images on a larger scale. Can you give me some advice on this? Or I have to trade-off between timing and memory? If so, what if don’t care about the timing?