I have been using
torch.nn.functional.grad.conv3d_weight to compute the gradient of the convolution kernel, but I have noticed that it uses much more memory than whatever method Autograd is calling. For context, I am working on implementing a form of reversible networks. These networks do not need to store activations in the forward pass, so I am avoiding the use of PyTorch’s autograd and writing my own backward method to take advantage of the memory efficiencies of these architectures.
More specifically the line:
grad_output = grad_output.repeat(1, in_channels // groups, 1, 1, 1)
creates a very large intermediate tensor when
in_channels is large.
I noticed this during training, so I have seperated it out into this simple script in hopes of finding a better way to do this.
import torch import torch.nn.functional as F import argparse def byte2mb(x): return x*1e-6 def mem_report(): for obj in gc.get_objects(): if torch.is_tensor(obj): print(type(obj), obj.size()) def check_mem(report=False): mem_alloc = byte2mb(torch.cuda.memory_allocated()) mem_cached = byte2mb(torch.cuda.memory_cached()) if report: mem_report() print('Mem Alloc: %6.2f, Mem Cached: %6.2f' % (mem_alloc, mem_cached)) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--no_autograd', dest='no_autograd', action='store_true') args = parser.parse_args() device = 'cuda:0' if torch.cuda.is_available() else 'cpu' kernel_size = 3 # Define an input tensor and a convolution kernel (N, C, D, W, H) = (1,64,64,128,128) x = torch.rand(N, C, D, W, H).to(device) K = torch.rand(C, C, kernel_size, kernel_size, kernel_size, requires_grad=True).to(device) # Compute gradients without auto grad if args.no_autograd: with torch.no_grad(): y = F.conv3d(x, K, padding=kernel_size//2) dy = torch.ones_like(y) dK = F.grad.conv3d_weight(x, K.shape, dy, padding=kernel_size//2) else: # Compute gradients with auto grad y = F.conv3d(x, K, padding=kernel_size//2) loss = y.sum() loss.backward() check_mem()
When ran, I get:
$ python test_conv3d_backward.py Mem Alloc: 537.31, Mem Cached: 807.40 $ python test_conv3d_backward.py --no_autograd Traceback (most recent call last): File "test_conv3d_backward.py", line 42, in <module> dK = F.grad.conv3d_weight(x, K.shape, dy, padding=kernel_size//2) File ".../python3.7/site-packages/torch/nn/grad.py", line 287, in conv3d_weight grad_output = grad_output.repeat(1, in_channels // groups, 1, 1, 1) RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 10.76 GiB total capacity; 768.42 MiB already allocated; 9.07 GiB free; 770.00 MiB reserved in total by PyTorch)
This test is not very specific, but clearly Autograd is calling a different method that doesn’t require creating the 16GB intermediate tensor.
Is there anyway for me to call that method directly? Ideally I would like to call the ConvBackward method that AutoGrad is using, however I am sure that this is done in C++. A few years ago apaszke said that it wasn’t possible, but there have been some more recent disscussions where people wrote their own C++ extension.