Hi,
I have been using torch.nn.functional.grad.conv3d_weight
to compute the gradient of the convolution kernel, but I have noticed that it uses much more memory than whatever method Autograd is calling. For context, I am working on implementing a form of reversible networks. These networks do not need to store activations in the forward pass, so I am avoiding the use of PyTorch’s autograd and writing my own backward method to take advantage of the memory efficiencies of these architectures.
More specifically the line:
grad_output = grad_output.repeat(1, in_channels // groups, 1, 1, 1)
creates a very large intermediate tensor when in_channels
is large.
I noticed this during training, so I have seperated it out into this simple script in hopes of finding a better way to do this.
import torch
import torch.nn.functional as F
import argparse
def byte2mb(x):
return x*1e-6
def mem_report():
for obj in gc.get_objects():
if torch.is_tensor(obj):
print(type(obj), obj.size())
def check_mem(report=False):
mem_alloc = byte2mb(torch.cuda.memory_allocated())
mem_cached = byte2mb(torch.cuda.memory_cached())
if report:
mem_report()
print('Mem Alloc: %6.2f, Mem Cached: %6.2f' % (mem_alloc, mem_cached))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--no_autograd', dest='no_autograd', action='store_true')
args = parser.parse_args()
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
kernel_size = 3
# Define an input tensor and a convolution kernel
(N, C, D, W, H) = (1,64,64,128,128)
x = torch.rand(N, C, D, W, H).to(device)
K = torch.rand(C, C, kernel_size, kernel_size, kernel_size, requires_grad=True).to(device)
# Compute gradients without auto grad
if args.no_autograd:
with torch.no_grad():
y = F.conv3d(x, K, padding=kernel_size//2)
dy = torch.ones_like(y)
dK = F.grad.conv3d_weight(x, K.shape, dy, padding=kernel_size//2)
else:
# Compute gradients with auto grad
y = F.conv3d(x, K, padding=kernel_size//2)
loss = y.sum()
loss.backward()
check_mem()
When ran, I get:
$ python test_conv3d_backward.py
Mem Alloc: 537.31, Mem Cached: 807.40
$ python test_conv3d_backward.py --no_autograd
Traceback (most recent call last):
File "test_conv3d_backward.py", line 42, in <module>
dK = F.grad.conv3d_weight(x, K.shape, dy, padding=kernel_size//2)
File ".../python3.7/site-packages/torch/nn/grad.py", line 287, in conv3d_weight
grad_output = grad_output.repeat(1, in_channels // groups, 1, 1, 1)
RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 10.76 GiB total capacity; 768.42 MiB already allocated; 9.07 GiB free; 770.00 MiB reserved in total by PyTorch)
This test is not very specific, but clearly Autograd is calling a different method that doesn’t require creating the 16GB intermediate tensor.
Is there anyway for me to call that method directly? Ideally I would like to call the ConvBackward method that AutoGrad is using, however I am sure that this is done in C++. A few years ago apaszke said that it wasn’t possible, but there have been some more recent disscussions where people wrote their own C++ extension.
Thanks