Torch.cuda.empty_cache(), memory segmentation and runtime

Hello, I am working with a model whose VRAM requirements with 1080p frames make it go out of memory due to caching after the first iteration. When it does so I get the following error about memory fragmentation:

RuntimeError: CUDA out of memory. 
Tried to allocate 776.00 MiB (GPU 0; 14.76 GiB total capacity; 
11.41 GiB already allocated; 557.75 MiB free; 13.12 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. 
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

To circumvent this problem, I found out that I can simply use torch.cuda.empty_cache() at the end of the every iteration, like this:

with torch.no_grad():
    for i in range(100):
        start_time = time.time()
        output = pytorch_model(image_1, image_2)
        print(f'It took {time.time()-start_time}s')
        torch.cuda.empty_cache()

The strange thing that I found out, is that this not only allows me to run 1080p frames without going OOM, but by using this I get a speed up with lower resolution frames compared to if I did not empty the cache. I have read that this is not supposed to happen, and actually emptying the cache should slow down the process.

My guess is that what is slowing everything down here even more is the memory fragmentation, and emptying the cache, although usually decreases speed, here actually makes it faster because it solves the fragmentation? I am not entirely sure what is going on… any ideas? I am trying to speed up the model as much as possible.

This is correct, since PyTorch calls empty_cache() internally once it hits an OOM and tries to reallocate the memory. If this fails, the error is raised so your code shouldn’t make a difference.

Thank you for your answer! It shouldn’t make a difference in the sense that manually calling empty_cache shouldn’t have any effects in this case?

What I don’t undestand is why with lower resolution frames, where memory is not a problem, I get a speed up when calling empty_cache(). Why is that happening?

I also don’t know why this should be the case, since you would be forcing PyTorch to call into synchronizing cudaMalloc calls.
Could you profile your workload and check the timeline?

This is the profiling when emptying the cache

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  aten::zeros         0.01%      59.000us         0.02%     144.000us      24.000us          24 b           0 b           0 b           0 b             6  
                  aten::empty         0.26%       2.445ms         0.83%       7.814ms      26.578us         120 b         120 b       4.26 Gb       4.26 Gb           294  
                  aten::zero_         0.00%       5.000us         0.00%       5.000us       0.833us           0 b           0 b           0 b           0 b             6  
             Extraction layer         0.75%       7.012ms         3.48%      32.720ms      32.720ms          -4 b         -20 b       1.11 Gb      -4.40 Gb             1  
             aten::avg_pool2d         0.22%       2.113ms         0.77%       7.207ms     171.595us           0 b           0 b     270.55 Mb     270.55 Mb            42  
             cudaLaunchKernel         0.78%       7.334ms         0.78%       7.334ms       7.656us           0 b           0 b           0 b           0 b           958  
                   cudaMalloc         2.66%      24.999ms         2.66%      24.999ms     694.417us           0 b           0 b           0 b           0 b            36  
                 aten::conv2d         0.05%     494.000us         2.40%      22.591ms     132.111us           0 b           0 b       4.95 Gb           0 b           171  
      aten::_convolution_mode         0.06%     589.000us         2.17%      20.403ms     130.788us           0 b           0 b       4.92 Gb           0 b           156  
           aten::_convolution         0.24%       2.231ms         2.28%      21.462ms     125.509us           0 b           0 b       4.95 Gb           0 b           171  
      aten::cudnn_convolution         0.96%       9.021ms         1.56%      14.724ms      86.105us           0 b           0 b       4.95 Gb       3.79 Gb           171  
                aten::resize_         0.15%       1.428ms         1.37%      12.875ms      30.152us           0 b           0 b       6.85 Gb       6.85 Gb           427  
              cudaEventRecord         0.04%     360.000us         0.04%     360.000us       2.105us           0 b           0 b           0 b           0 b           171  
                aten::reshape         0.05%     513.000us         0.13%       1.252ms       7.322us           0 b           0 b           0 b           0 b           171  
         aten::_reshape_alias         0.08%     739.000us         0.08%     739.000us       4.322us           0 b           0 b           0 b           0 b           171  
                   aten::add_         0.23%       2.179ms         0.35%       3.255ms      19.035us           0 b           0 b           0 b           0 b           171  
             aten::leaky_relu         0.32%       3.034ms         0.68%       6.435ms      42.336us           0 b           0 b       4.50 Gb       4.50 Gb           152  
                    aten::cat         0.04%     411.000us         1.66%      15.647ms     184.082us           0 b           0 b       6.85 Gb           0 b            85  
                   aten::_cat         0.15%       1.373ms         1.62%      15.236ms     179.247us           0 b           0 b       6.85 Gb           0 b            85  
             Estimation layer         0.84%       7.926ms        45.34%     426.754ms     426.754ms          -4 b         -20 b      20.16 Mb      -6.37 Gb             1  
              cudaBindTexture         0.01%     100.000us         0.01%     100.000us       6.250us           0 b           0 b           0 b           0 b            16  
            cudaUnbindTexture         0.00%      15.000us         0.00%      15.000us       0.938us           0 b           0 b           0 b           0 b            16  
            aten::convolution         0.00%      46.000us         0.18%       1.694ms     112.933us           0 b           0 b      30.71 Mb           0 b            15  
                    aten::mul         0.08%     720.000us         0.10%     950.000us      27.941us           0 b           0 b      28.38 Mb      28.38 Mb            34  
    aten::upsample_bilinear2d         0.05%     517.000us         0.07%     670.000us      27.917us           0 b           0 b      37.61 Mb      37.61 Mb            24  
                  aten::slice         0.06%     536.000us         0.07%     690.000us       3.750us           0 b           0 b           0 b           0 b           184  
             aten::as_strided         0.02%     233.000us         0.02%     233.000us       1.031us           0 b           0 b           0 b           0 b           226  
                    aten::div         0.27%       2.576ms         0.34%       3.169ms      72.023us           0 b           0 b      38.36 Mb      38.36 Mb            44  
                     aten::to         0.01%     106.000us        49.45%     465.446ms      21.157ms           0 b           0 b      38.89 Mb           0 b            22  
               aten::_to_copy         0.03%     251.000us        49.44%     465.340ms      21.152ms           0 b           0 b      38.89 Mb           0 b            22  
          aten::empty_strided         0.03%     325.000us         0.03%     325.000us      14.773us           0 b           0 b      38.89 Mb      38.89 Mb            22  
                  aten::copy_         0.05%     459.000us        49.39%     464.841ms      17.878ms           0 b           0 b           0 b           0 b            26  
              cudaMemcpyAsync        12.29%     115.691ms        12.29%     115.691ms       5.259ms           0 b           0 b           0 b           0 b            22  
        cudaStreamSynchronize        37.05%     348.666ms        37.05%     348.666ms      15.848ms           0 b           0 b           0 b           0 b            22  
                    aten::add         0.13%       1.195ms         0.18%       1.660ms      36.087us           0 b           0 b      75.50 Mb      75.50 Mb            46  
                aten::permute         0.02%     220.000us         0.03%     299.000us       7.119us           0 b           0 b           0 b           0 b            42  
           aten::grid_sampler         0.02%     182.000us         0.47%       4.433ms     201.500us           0 b           0 b       2.22 Gb           0 b            22  
        aten::grid_sampler_2d         0.03%     320.000us         0.45%       4.251ms     193.227us           0 b           0 b       2.22 Gb           0 b            22  
                  Concatenate         0.10%     955.000us         0.51%       4.808ms       2.404ms          -8 b         -40 b       1.17 Gb     -23.56 Mb             2  
                   Warp layer         0.21%       1.975ms         8.25%      77.666ms      77.666ms          -4 b         -20 b       2.24 Gb     -74.93 Mb             1  
                 Fusion layer         0.13%       1.186ms         0.93%       8.724ms       8.724ms          -4 b         -20 b      10.55 Mb      -5.34 Gb             1  
     aten::upsample_nearest2d         0.02%     142.000us         0.06%     537.000us     134.250us           0 b           0 b     893.52 Mb     893.52 Mb             4  
        aten::constant_pad_nd         0.01%      65.000us         0.20%       1.876ms     469.000us           0 b           0 b     898.46 Mb           0 b             4  
                  aten::fill_         0.01%      53.000us         0.01%      98.000us      24.500us           0 b           0 b           0 b           0 b             4  
                 aten::narrow         0.00%      17.000us         0.00%      45.000us       5.625us           0 b           0 b           0 b           0 b             8  
                     [memory]         0.00%       0.000us         0.00%       0.000us       0.000us           0 b           0 b      -4.55 Gb      -4.55 Gb            86  
                     cudaFree        41.47%     390.344ms        41.47%     390.344ms      10.843ms           0 b           0 b           0 b           0 b            36  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 941.160ms

And this is the profiling when not emptying the cache

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  aten::zeros         0.01%      52.000us         0.03%     145.000us      24.167us          24 b           0 b           0 b           0 b             6  
                  aten::empty         0.51%       2.718ms         0.51%       2.718ms       9.245us         120 b         120 b       4.26 Gb       4.26 Gb           294  
                  aten::zero_         0.00%       8.000us         0.00%       8.000us       1.333us           0 b           0 b           0 b           0 b             6  
             Extraction layer         1.17%       6.293ms         4.23%      22.747ms      22.747ms          -4 b         -20 b       1.11 Gb      -4.40 Gb             1  
             aten::avg_pool2d         0.38%       2.027ms         0.44%       2.363ms      56.262us           0 b           0 b     269.72 Mb     269.72 Mb            42  
             cudaLaunchKernel         1.46%       7.829ms         1.46%       7.829ms       8.172us           0 b           0 b           0 b           0 b           958  
                 aten::conv2d         0.08%     423.000us         4.33%      23.271ms     136.088us           0 b           0 b       4.94 Gb           0 b           171  
      aten::_convolution_mode         0.11%     575.000us         3.86%      20.716ms     132.795us           0 b           0 b       4.91 Gb           0 b           156  
           aten::_convolution         0.44%       2.351ms         4.14%      22.230ms     130.000us           0 b           0 b       4.94 Gb           0 b           171  
      aten::cudnn_convolution         1.81%       9.703ms         2.76%      14.803ms      86.567us           0 b           0 b       4.94 Gb       3.77 Gb           171  
                aten::resize_         0.26%       1.383ms         0.26%       1.383ms       3.239us           0 b           0 b       6.85 Gb       6.85 Gb           427  
              cudaEventRecord         0.08%     413.000us         0.08%     413.000us       2.415us           0 b           0 b           0 b           0 b           171  
                aten::reshape         0.10%     556.000us         0.25%       1.348ms       7.883us           0 b           0 b           0 b           0 b           171  
         aten::_reshape_alias         0.15%     792.000us         0.15%     792.000us       4.632us           0 b           0 b           0 b           0 b           171  
                   aten::add_         0.46%       2.467ms         0.69%       3.728ms      21.801us           0 b           0 b           0 b           0 b           171  
             aten::leaky_relu         0.61%       3.277ms         0.82%       4.395ms      28.914us           0 b           0 b       4.50 Gb       4.50 Gb           152  
                    aten::cat         0.09%     471.000us         0.78%       4.187ms      49.259us           0 b           0 b       6.85 Gb           0 b            85  
                   aten::_cat         0.27%       1.438ms         0.69%       3.716ms      43.718us           0 b           0 b       6.85 Gb           0 b            85  
             Estimation layer         1.63%       8.730ms        80.54%     432.639ms     432.639ms          -4 b         -20 b      18.75 Mb      -6.36 Gb             1  
              cudaBindTexture         0.02%     117.000us         0.02%     117.000us       7.312us           0 b           0 b           0 b           0 b            16  
            cudaUnbindTexture         0.00%      18.000us         0.00%      18.000us       1.125us           0 b           0 b           0 b           0 b            16  
            aten::convolution         0.01%      43.000us         0.40%       2.135ms     142.333us           0 b           0 b      29.30 Mb           0 b            15  
                    aten::mul         0.17%     888.000us         0.22%       1.177ms      34.618us           0 b           0 b      28.49 Mb      28.49 Mb            34  
    aten::upsample_bilinear2d         0.12%     650.000us         0.16%     852.000us      35.500us           0 b           0 b      38.56 Mb      38.56 Mb            24  
                  aten::slice         0.14%     772.000us         0.19%       1.016ms       5.522us           0 b           0 b           0 b           0 b           184  
             aten::as_strided         0.06%     339.000us         0.06%     339.000us       1.500us           0 b           0 b           0 b           0 b           226  
                    aten::div         0.51%       2.759ms         0.59%       3.154ms      71.682us           0 b           0 b      37.48 Mb      37.48 Mb            44  
                     aten::to         0.02%      85.000us        88.07%     473.090ms      21.504ms           0 b           0 b      37.48 Mb           0 b            22  
               aten::_to_copy         0.06%     300.000us        88.05%     473.005ms      21.500ms           0 b           0 b      37.48 Mb           0 b            22  
          aten::empty_strided         0.07%     399.000us         0.07%     399.000us      18.136us           0 b           0 b      37.48 Mb      37.48 Mb            22  
                  aten::copy_         0.09%     489.000us        87.95%     472.422ms      18.170ms           0 b           0 b           0 b           0 b            26  
              cudaMemcpyAsync        21.71%     116.611ms        21.71%     116.611ms       5.301ms           0 b           0 b           0 b           0 b            22  
        cudaStreamSynchronize        66.14%     355.270ms        66.14%     355.270ms      16.149ms           0 b           0 b           0 b           0 b            22  
                    aten::add         0.27%       1.462ms         0.38%       2.030ms      44.130us           0 b           0 b      75.15 Mb      75.15 Mb            46  
                aten::permute         0.04%     234.000us         0.06%     329.000us       7.833us           0 b           0 b           0 b           0 b            42  
           aten::grid_sampler         0.03%     146.000us         0.21%       1.121ms      50.955us           0 b           0 b       2.22 Gb           0 b            22  
        aten::grid_sampler_2d         0.07%     358.000us         0.18%     975.000us      44.318us           0 b           0 b       2.22 Gb           0 b            22  
                  Concatenate         0.21%       1.132ms         0.56%       2.985ms       1.492ms          -8 b         -40 b       1.17 Gb     -24.04 Mb             2  
                   Warp layer         0.39%       2.116ms        13.82%      74.230ms      74.230ms          -4 b         -20 b       2.24 Gb     -75.82 Mb             1  
                 Fusion layer         0.23%       1.217ms         0.82%       4.431ms       4.431ms          -4 b         -20 b      10.55 Mb      -5.34 Gb             1  
     aten::upsample_nearest2d         0.02%     123.000us         0.03%     164.000us      41.000us           0 b           0 b     894.52 Mb     894.52 Mb             4  
        aten::constant_pad_nd         0.01%      73.000us         0.07%     396.000us      99.000us           0 b           0 b     898.30 Mb           0 b             4  
                  aten::fill_         0.01%      50.000us         0.02%      87.000us      21.750us           0 b           0 b           0 b           0 b             4  
                 aten::narrow         0.00%      20.000us         0.01%      65.000us       8.125us           0 b           0 b           0 b           0 b             8  
                     [memory]         0.00%       0.000us         0.00%       0.000us       0.000us           0 b           0 b      -4.55 Gb      -4.55 Gb            86  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 537.177ms

Here it correctly says that when not emptying the cache the model is faster (both have been warmed up). The extra time taken by the first one is because it takes time to empty the cache.

However, in reality when timing the models with this:

with torch.no_grad():
    for i in range(100):
        start_time = time.time()
        output = pytorch_model(image_1, image_2)
        # torch.cuda.empty_cache()
        print(f'It took {time.time()-start_time}s')

And I comment/uncomment the caching operation, they take about the same time (~0.95s) (slightly faster when not emptying the cache). However, the inference time of the model alone, calculated like this:

with torch.no_grad():
    for i in range(100):
        start_time = time.time()
        output = pytorch_model(image_1, image_2)
        print(f'It took {time.time()-start_time}s')
        # torch.cuda.empty_cache()

when emptying the cache I get a much faster inference speed (0.55s) compare to when I don’t empty the cache (0.95s). Does this mean that my model would run faster if I didn’t have any caching enabled? It looks like when emptying the cache the inference speed is faster, however then emptying the cache takes time. I am a bit confused.

This is not a proper way to profile the models, since CUDA operations are executed asynchronously.
You are currently not timing the actual kernels, but the Python overhead, the kernel launches, etc. if you don’t synchronize the code.

In the second code snippet you have moved the empty_cache call after stopping the (invalid) timer. Given that you are profiling at max the kernel launches, this might be expected.