I tried to add zero padding before particular convolution layers, but when I tried the below code, CUDA OOM occurs after ~40 epochs (without the pre hook, no OOM). I used
forward_pre_hook as follows:
import torch.nn as nn import torch.nn.functional as F from torch.nn.utils import spectral_norm def conv_layer(inplanes, planes, kernel_size, stride, padding, bias, do_spectral_norm): conv = nn.Conv2d(in_channels=inplanes, out_channels=planes, kernel_size=kernel_size, stride=stride, padding=padding, bias=bias) if conv.kernel_size == (3, 3) and conv.stride == (2, 2) and conv.padding == (1, 1): conv.padding = (0, 0) def forward_pre_hook(module, input): print('forward_pre_hook has been fired.') modified_input = F.pad(input, pad=(0, 1, 0, 1)) return modified_input conv.register_forward_pre_hook(forward_pre_hook) if do_spectral_norm: return spectral_norm(conv) else: return conv
conv_layer function is called several times in the main module.
After I turned on the
forward_pre_hook, increment of allocated GPU memory was observed (18116MB to 21408MB). I think this is a natural behavior as the extra backwards graph corresponding to the added padding node would be created. However, as the training proceeds, it increased to 24146MB and 24154MB - here is I do not understand. How can I stop the memory leak? (Note that, I set cudnn.deterministic as True and cudnn.benchmark as False, for exact observation of the memory.)
spectral_norm can be the problem? As far as I know, the spectral_norm implementation uses
forward_pre_hook as well and duplication of those hooks can be the source of problem? (FYI, I don’t fully know about the details inside
Any suggestions are welcome. Thank you.
Torch version: 1.7.1
Python version: 3.7.4