Hello, I am working with a model whose VRAM requirements with 1080p frames make it go out of memory due to caching after the first iteration. When it does so I get the following error about memory fragmentation:
RuntimeError: CUDA out of memory.
Tried to allocate 776.00 MiB (GPU 0; 14.76 GiB total capacity;
11.41 GiB already allocated; 557.75 MiB free; 13.12 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
To circumvent this problem, I found out that I can simply use torch.cuda.empty_cache() at the end of the every iteration, like this:
with torch.no_grad():
for i in range(100):
start_time = time.time()
output = pytorch_model(image_1, image_2)
print(f'It took {time.time()-start_time}s')
torch.cuda.empty_cache()
The strange thing that I found out, is that this not only allows me to run 1080p frames without going OOM, but by using this I get a speed up with lower resolution frames compared to if I did not empty the cache. I have read that this is not supposed to happen, and actually emptying the cache should slow down the process.
My guess is that what is slowing everything down here even more is the memory fragmentation, and emptying the cache, although usually decreases speed, here actually makes it faster because it solves the fragmentation? I am not entirely sure what is going on… any ideas? I am trying to speed up the model as much as possible.
This is correct, since PyTorch calls empty_cache() internally once it hits an OOM and tries to reallocate the memory. If this fails, the error is raised so your code shouldn’t make a difference.
Thank you for your answer! It shouldn’t make a difference in the sense that manually calling empty_cache shouldn’t have any effects in this case?
What I don’t undestand is why with lower resolution frames, where memory is not a problem, I get a speed up when calling empty_cache(). Why is that happening?
I also don’t know why this should be the case, since you would be forcing PyTorch to call into synchronizing cudaMalloc calls.
Could you profile your workload and check the timeline?
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::zeros 0.01% 59.000us 0.02% 144.000us 24.000us 24 b 0 b 0 b 0 b 6
aten::empty 0.26% 2.445ms 0.83% 7.814ms 26.578us 120 b 120 b 4.26 Gb 4.26 Gb 294
aten::zero_ 0.00% 5.000us 0.00% 5.000us 0.833us 0 b 0 b 0 b 0 b 6
Extraction layer 0.75% 7.012ms 3.48% 32.720ms 32.720ms -4 b -20 b 1.11 Gb -4.40 Gb 1
aten::avg_pool2d 0.22% 2.113ms 0.77% 7.207ms 171.595us 0 b 0 b 270.55 Mb 270.55 Mb 42
cudaLaunchKernel 0.78% 7.334ms 0.78% 7.334ms 7.656us 0 b 0 b 0 b 0 b 958
cudaMalloc 2.66% 24.999ms 2.66% 24.999ms 694.417us 0 b 0 b 0 b 0 b 36
aten::conv2d 0.05% 494.000us 2.40% 22.591ms 132.111us 0 b 0 b 4.95 Gb 0 b 171
aten::_convolution_mode 0.06% 589.000us 2.17% 20.403ms 130.788us 0 b 0 b 4.92 Gb 0 b 156
aten::_convolution 0.24% 2.231ms 2.28% 21.462ms 125.509us 0 b 0 b 4.95 Gb 0 b 171
aten::cudnn_convolution 0.96% 9.021ms 1.56% 14.724ms 86.105us 0 b 0 b 4.95 Gb 3.79 Gb 171
aten::resize_ 0.15% 1.428ms 1.37% 12.875ms 30.152us 0 b 0 b 6.85 Gb 6.85 Gb 427
cudaEventRecord 0.04% 360.000us 0.04% 360.000us 2.105us 0 b 0 b 0 b 0 b 171
aten::reshape 0.05% 513.000us 0.13% 1.252ms 7.322us 0 b 0 b 0 b 0 b 171
aten::_reshape_alias 0.08% 739.000us 0.08% 739.000us 4.322us 0 b 0 b 0 b 0 b 171
aten::add_ 0.23% 2.179ms 0.35% 3.255ms 19.035us 0 b 0 b 0 b 0 b 171
aten::leaky_relu 0.32% 3.034ms 0.68% 6.435ms 42.336us 0 b 0 b 4.50 Gb 4.50 Gb 152
aten::cat 0.04% 411.000us 1.66% 15.647ms 184.082us 0 b 0 b 6.85 Gb 0 b 85
aten::_cat 0.15% 1.373ms 1.62% 15.236ms 179.247us 0 b 0 b 6.85 Gb 0 b 85
Estimation layer 0.84% 7.926ms 45.34% 426.754ms 426.754ms -4 b -20 b 20.16 Mb -6.37 Gb 1
cudaBindTexture 0.01% 100.000us 0.01% 100.000us 6.250us 0 b 0 b 0 b 0 b 16
cudaUnbindTexture 0.00% 15.000us 0.00% 15.000us 0.938us 0 b 0 b 0 b 0 b 16
aten::convolution 0.00% 46.000us 0.18% 1.694ms 112.933us 0 b 0 b 30.71 Mb 0 b 15
aten::mul 0.08% 720.000us 0.10% 950.000us 27.941us 0 b 0 b 28.38 Mb 28.38 Mb 34
aten::upsample_bilinear2d 0.05% 517.000us 0.07% 670.000us 27.917us 0 b 0 b 37.61 Mb 37.61 Mb 24
aten::slice 0.06% 536.000us 0.07% 690.000us 3.750us 0 b 0 b 0 b 0 b 184
aten::as_strided 0.02% 233.000us 0.02% 233.000us 1.031us 0 b 0 b 0 b 0 b 226
aten::div 0.27% 2.576ms 0.34% 3.169ms 72.023us 0 b 0 b 38.36 Mb 38.36 Mb 44
aten::to 0.01% 106.000us 49.45% 465.446ms 21.157ms 0 b 0 b 38.89 Mb 0 b 22
aten::_to_copy 0.03% 251.000us 49.44% 465.340ms 21.152ms 0 b 0 b 38.89 Mb 0 b 22
aten::empty_strided 0.03% 325.000us 0.03% 325.000us 14.773us 0 b 0 b 38.89 Mb 38.89 Mb 22
aten::copy_ 0.05% 459.000us 49.39% 464.841ms 17.878ms 0 b 0 b 0 b 0 b 26
cudaMemcpyAsync 12.29% 115.691ms 12.29% 115.691ms 5.259ms 0 b 0 b 0 b 0 b 22
cudaStreamSynchronize 37.05% 348.666ms 37.05% 348.666ms 15.848ms 0 b 0 b 0 b 0 b 22
aten::add 0.13% 1.195ms 0.18% 1.660ms 36.087us 0 b 0 b 75.50 Mb 75.50 Mb 46
aten::permute 0.02% 220.000us 0.03% 299.000us 7.119us 0 b 0 b 0 b 0 b 42
aten::grid_sampler 0.02% 182.000us 0.47% 4.433ms 201.500us 0 b 0 b 2.22 Gb 0 b 22
aten::grid_sampler_2d 0.03% 320.000us 0.45% 4.251ms 193.227us 0 b 0 b 2.22 Gb 0 b 22
Concatenate 0.10% 955.000us 0.51% 4.808ms 2.404ms -8 b -40 b 1.17 Gb -23.56 Mb 2
Warp layer 0.21% 1.975ms 8.25% 77.666ms 77.666ms -4 b -20 b 2.24 Gb -74.93 Mb 1
Fusion layer 0.13% 1.186ms 0.93% 8.724ms 8.724ms -4 b -20 b 10.55 Mb -5.34 Gb 1
aten::upsample_nearest2d 0.02% 142.000us 0.06% 537.000us 134.250us 0 b 0 b 893.52 Mb 893.52 Mb 4
aten::constant_pad_nd 0.01% 65.000us 0.20% 1.876ms 469.000us 0 b 0 b 898.46 Mb 0 b 4
aten::fill_ 0.01% 53.000us 0.01% 98.000us 24.500us 0 b 0 b 0 b 0 b 4
aten::narrow 0.00% 17.000us 0.00% 45.000us 5.625us 0 b 0 b 0 b 0 b 8
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us 0 b 0 b -4.55 Gb -4.55 Gb 86
cudaFree 41.47% 390.344ms 41.47% 390.344ms 10.843ms 0 b 0 b 0 b 0 b 36
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 941.160ms
And this is the profiling when not emptying the cache
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::zeros 0.01% 52.000us 0.03% 145.000us 24.167us 24 b 0 b 0 b 0 b 6
aten::empty 0.51% 2.718ms 0.51% 2.718ms 9.245us 120 b 120 b 4.26 Gb 4.26 Gb 294
aten::zero_ 0.00% 8.000us 0.00% 8.000us 1.333us 0 b 0 b 0 b 0 b 6
Extraction layer 1.17% 6.293ms 4.23% 22.747ms 22.747ms -4 b -20 b 1.11 Gb -4.40 Gb 1
aten::avg_pool2d 0.38% 2.027ms 0.44% 2.363ms 56.262us 0 b 0 b 269.72 Mb 269.72 Mb 42
cudaLaunchKernel 1.46% 7.829ms 1.46% 7.829ms 8.172us 0 b 0 b 0 b 0 b 958
aten::conv2d 0.08% 423.000us 4.33% 23.271ms 136.088us 0 b 0 b 4.94 Gb 0 b 171
aten::_convolution_mode 0.11% 575.000us 3.86% 20.716ms 132.795us 0 b 0 b 4.91 Gb 0 b 156
aten::_convolution 0.44% 2.351ms 4.14% 22.230ms 130.000us 0 b 0 b 4.94 Gb 0 b 171
aten::cudnn_convolution 1.81% 9.703ms 2.76% 14.803ms 86.567us 0 b 0 b 4.94 Gb 3.77 Gb 171
aten::resize_ 0.26% 1.383ms 0.26% 1.383ms 3.239us 0 b 0 b 6.85 Gb 6.85 Gb 427
cudaEventRecord 0.08% 413.000us 0.08% 413.000us 2.415us 0 b 0 b 0 b 0 b 171
aten::reshape 0.10% 556.000us 0.25% 1.348ms 7.883us 0 b 0 b 0 b 0 b 171
aten::_reshape_alias 0.15% 792.000us 0.15% 792.000us 4.632us 0 b 0 b 0 b 0 b 171
aten::add_ 0.46% 2.467ms 0.69% 3.728ms 21.801us 0 b 0 b 0 b 0 b 171
aten::leaky_relu 0.61% 3.277ms 0.82% 4.395ms 28.914us 0 b 0 b 4.50 Gb 4.50 Gb 152
aten::cat 0.09% 471.000us 0.78% 4.187ms 49.259us 0 b 0 b 6.85 Gb 0 b 85
aten::_cat 0.27% 1.438ms 0.69% 3.716ms 43.718us 0 b 0 b 6.85 Gb 0 b 85
Estimation layer 1.63% 8.730ms 80.54% 432.639ms 432.639ms -4 b -20 b 18.75 Mb -6.36 Gb 1
cudaBindTexture 0.02% 117.000us 0.02% 117.000us 7.312us 0 b 0 b 0 b 0 b 16
cudaUnbindTexture 0.00% 18.000us 0.00% 18.000us 1.125us 0 b 0 b 0 b 0 b 16
aten::convolution 0.01% 43.000us 0.40% 2.135ms 142.333us 0 b 0 b 29.30 Mb 0 b 15
aten::mul 0.17% 888.000us 0.22% 1.177ms 34.618us 0 b 0 b 28.49 Mb 28.49 Mb 34
aten::upsample_bilinear2d 0.12% 650.000us 0.16% 852.000us 35.500us 0 b 0 b 38.56 Mb 38.56 Mb 24
aten::slice 0.14% 772.000us 0.19% 1.016ms 5.522us 0 b 0 b 0 b 0 b 184
aten::as_strided 0.06% 339.000us 0.06% 339.000us 1.500us 0 b 0 b 0 b 0 b 226
aten::div 0.51% 2.759ms 0.59% 3.154ms 71.682us 0 b 0 b 37.48 Mb 37.48 Mb 44
aten::to 0.02% 85.000us 88.07% 473.090ms 21.504ms 0 b 0 b 37.48 Mb 0 b 22
aten::_to_copy 0.06% 300.000us 88.05% 473.005ms 21.500ms 0 b 0 b 37.48 Mb 0 b 22
aten::empty_strided 0.07% 399.000us 0.07% 399.000us 18.136us 0 b 0 b 37.48 Mb 37.48 Mb 22
aten::copy_ 0.09% 489.000us 87.95% 472.422ms 18.170ms 0 b 0 b 0 b 0 b 26
cudaMemcpyAsync 21.71% 116.611ms 21.71% 116.611ms 5.301ms 0 b 0 b 0 b 0 b 22
cudaStreamSynchronize 66.14% 355.270ms 66.14% 355.270ms 16.149ms 0 b 0 b 0 b 0 b 22
aten::add 0.27% 1.462ms 0.38% 2.030ms 44.130us 0 b 0 b 75.15 Mb 75.15 Mb 46
aten::permute 0.04% 234.000us 0.06% 329.000us 7.833us 0 b 0 b 0 b 0 b 42
aten::grid_sampler 0.03% 146.000us 0.21% 1.121ms 50.955us 0 b 0 b 2.22 Gb 0 b 22
aten::grid_sampler_2d 0.07% 358.000us 0.18% 975.000us 44.318us 0 b 0 b 2.22 Gb 0 b 22
Concatenate 0.21% 1.132ms 0.56% 2.985ms 1.492ms -8 b -40 b 1.17 Gb -24.04 Mb 2
Warp layer 0.39% 2.116ms 13.82% 74.230ms 74.230ms -4 b -20 b 2.24 Gb -75.82 Mb 1
Fusion layer 0.23% 1.217ms 0.82% 4.431ms 4.431ms -4 b -20 b 10.55 Mb -5.34 Gb 1
aten::upsample_nearest2d 0.02% 123.000us 0.03% 164.000us 41.000us 0 b 0 b 894.52 Mb 894.52 Mb 4
aten::constant_pad_nd 0.01% 73.000us 0.07% 396.000us 99.000us 0 b 0 b 898.30 Mb 0 b 4
aten::fill_ 0.01% 50.000us 0.02% 87.000us 21.750us 0 b 0 b 0 b 0 b 4
aten::narrow 0.00% 20.000us 0.01% 65.000us 8.125us 0 b 0 b 0 b 0 b 8
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us 0 b 0 b -4.55 Gb -4.55 Gb 86
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 537.177ms
Here it correctly says that when not emptying the cache the model is faster (both have been warmed up). The extra time taken by the first one is because it takes time to empty the cache.
However, in reality when timing the models with this:
with torch.no_grad():
for i in range(100):
start_time = time.time()
output = pytorch_model(image_1, image_2)
# torch.cuda.empty_cache()
print(f'It took {time.time()-start_time}s')
And I comment/uncomment the caching operation, they take about the same time (~0.95s) (slightly faster when not emptying the cache). However, the inference time of the model alone, calculated like this:
with torch.no_grad():
for i in range(100):
start_time = time.time()
output = pytorch_model(image_1, image_2)
print(f'It took {time.time()-start_time}s')
# torch.cuda.empty_cache()
when emptying the cache I get a much faster inference speed (0.55s) compare to when I don’t empty the cache (0.95s). Does this mean that my model would run faster if I didn’t have any caching enabled? It looks like when emptying the cache the inference speed is faster, however then emptying the cache takes time. I am a bit confused.
This is not a proper way to profile the models, since CUDA operations are executed asynchronously.
You are currently not timing the actual kernels, but the Python overhead, the kernel launches, etc. if you don’t synchronize the code.
In the second code snippet you have moved the empty_cache call after stopping the (invalid) timer. Given that you are profiling at max the kernel launches, this might be expected.