I am trying to use the BigGAN model, but I found the below error during test when I generated images using the pretrained weights. How Can I solve the problem? (Not related to the memory)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_34/2818109655.py in <module>
12 # real_img = imgs.to(device)
13 # bs = imgs.size(0)
---> 14 fake_imgs = pretrained_G(imgs.cuda())
15
16
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
/tmp/ipykernel_34/2782497050.py in forward(self, z)
66 # Feed through generator blocks
67 for idx, g_block in enumerate(self.g_blocks):
---> 68 h = g_block(h)
69 # h = g_block[1](h)
70
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
/tmp/ipykernel_34/2262133493.py in forward(self, x)
31
32 # Compute dot product attention with query (theta) and key (phi) matrices
---> 33 beta = F.softmax(torch.bmm(theta.transpose(1, 2), phi), dim=-1)
34
35 # Compute scaled dot product attention with value (g) and attention (beta) matrices
RuntimeError: CUDA out of memory. Tried to allocate 6.00 GiB (GPU 0; 15.90 GiB total capacity; 6.89 GiB already allocated; 2.16 GiB free; 12.68 GiB reserved in total by PyTorch)
If the model is working during training, I assume it’s failing during validation?
If so, your memory usage might be still too high once you start the validation loop and you should check if unnecessary tensors are kept alive (e.g. the last loss tensor, which could still be attached to the computation graph).
Yes, during the validation you wouldn’t update the model, as this would be a data leak.
It’s still unclear in which setup you are hitting the OOM.
In any case, wrap the validation loop into with torch.no_grad() and check the memory usage via adding print(torch.cuda.memory_summary()) into the validation code to see how large the actual memory requirement is and which line of code causes the OOM error.