Memory allocation in ConvTranspose3d

tivaro.nl · February 6, 2019, 1:27pm

Why is the memory usage for ConvTranspose3d so high?

import torch

from utils.memory_profiling import format_memsize, tensor_size

torch.backends.cudnn.benchmark = True

x = torch.rand(1, 128, 8, 270, 480)
conv = torch.nn.ConvTranspose3d(128, 64, kernel_size=5, stride=2, padding=2, output_padding=1)

x = x.to('cuda:0')
conv = conv.to('cuda:0')
y = conv(x)

Raises:

Traceback (most recent call last):
  File "bug.py", line 15, in <module>
    conv(x)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/conv.py", line 895, in forward
    output_padding, self.groups, self.dilation)
RuntimeError: CUDA out of memory. Tried to allocate 30.90 GiB (GPU 0; 10.92 GiB total capacity; 2.48 GiB already allocated; 5.83 GiB free; 1023.50 KiB cached)

x has shape (1, 128, 8, 270, 480) and should be 506.2 MiB
y should have shape (1, 3, 8, 540, 960) and should be 189.8 MiB

So why does cuda try to allocate 30.90 GiB?

This happens regardless of benchmark and cudnn flags (supposing I am setting them correctly).

I am using pytorch 1.0.0, though it also occured in 0.4.4 on a GeForce GTX 1080 Ti.

ezyang · February 7, 2019, 3:02pm

Would it be possible for you to run your script under nvprof and post the output here?

tivaro.nl · February 8, 2019, 10:41am

It seems to be quite difficult to run the profiler with unified-memory-profiling for me. Is the output of the profiler without unified-memory-profiling any use?

RuntimeError: CUDA out of memory. Tried to allocate 30.90 GiB (GPU 0; 10.92 GiB total capacity; 2.48 GiB already allocated; 7.21 GiB free; 1023.50 KiB cached)
==9== Profiling application: python bug.py
==9== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  54.839ms         3  18.280ms     960ns  54.432ms  [CUDA memcpy HtoD]
      API calls:   98.56%  3.93209s         6  655.35ms  18.299us  3.92879s  cudaMalloc
                    1.38%  54.952ms         3  18.317ms  10.057us  54.483ms  cudaMemcpyAsync
                    0.02%  940.22us       185  5.0820us     628ns  191.37us  cuDeviceGetAttribute
                    0.02%  788.03us         2  394.01us  390.98us  397.05us  cudaGetDeviceProperties
                    0.01%  365.55us         2  182.78us  182.22us  183.34us  cuDeviceTotalMem
                    0.00%  164.13us         3  54.709us  4.5400us  83.042us  cudaStreamSynchronize
                    0.00%  120.34us         1  120.34us  120.34us  120.34us  cudaMemGetInfo
                    0.00%  83.531us         2  41.765us  39.391us  44.140us  cuDeviceGetName
                    0.00%  48.126us        42  1.1450us     768ns  5.3780us  cudaGetDevice
                    0.00%  31.360us        23  1.3630us     768ns  5.5880us  cudaSetDevice
                    0.00%  10.195us        13     784ns     629ns  1.2570us  cudaGetDeviceCount
                    0.00%  2.7930us         3     931ns     768ns  1.2570us  cuDeviceGetCount
                    0.00%  2.4440us         3     814ns     699ns     907ns  cuDeviceGet
                    0.00%  1.5360us         2     768ns     698ns     838ns  cudaGetLastError
                    0.00%     978ns         1     978ns     978ns     978ns  cuInit
                    0.00%     978ns         1     978ns     978ns     978ns  cuDriverGetVersion
======== Error: Application returned non-zero code 1

Let me know if it is not, I will see if I can dig any deeper to try and get the profiler to work.

ezyang · April 9, 2019, 3:13pm

Unfortunately, the non-unified profile is not too useful. I am mostly interested in knowing which kernel you are running. Perhaps you could run https://pytorch.org/docs/stable/autograd.html#profiler instead?

tivaro.nl · April 16, 2019, 8:56am

Code (decreased tensor size slightly so that it runs without error):

import torch

torch.backends.cudnn.benchmark = True

x = torch.rand(1, 128, 8, 180, 180)
conv = torch.nn.ConvTranspose3d(128, 64, kernel_size=5, stride=2, padding=2, output_padding=1)

x = x.to('cuda:0')
conv = conv.to('cuda:0')

with torch.autograd.profiler.profile(enabled=True, use_cuda=True) as prof:
    y = conv(x)

print(prof)

Result:

---------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                      CPU time        CUDA time            Calls        CPU total       CUDA total
---------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------
conv_transpose3d                      404852.197us     588765.916us                1     404852.197us     588765.916us
convolution                           404835.016us     588751.855us                1     404835.016us     588751.855us
_convolution                          404819.021us     588739.553us                1     404819.021us     588739.553us
contiguous                                 5.308us          5.984us                1          5.308us          5.984us
empty                                      4.051us          4.096us                1          4.051us          4.096us
_convolution_nogroup                  404767.199us     588701.587us                1     404767.199us     588701.587us
thnn_conv_transpose3d                 404753.928us     588692.262us                1     404753.928us     588692.262us
thnn_conv_transpose3d_forward         404734.862us     588680.168us                1     404734.862us     588680.168us