Mixed Precision training not working with Cudnn benchmark

Dear all,
I have upgraded torch to 1.6 to use native mixed precision training.
I am observing some strange behavior that mixed precision training does not seem to have any effect on model memory consumption with cudnn_benchmark=True.
I am using pytorch lightning so i set the mixed precision training as
System
*Pytorch 1.6 *
Pytorch lightning 0.8.1
Linux 18.01
GPU Nvidia Tesla T4

trainer(...., precision=16, amp_level='O2')

Following is output of nvidia-smi with FP16 enabled

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   73C    P0    64W /  70W |  12230MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

following is with FP16 disabled

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   77C    P0    63W /  70W |  12200MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

I was playing around with other parameters and found out that if i disable cudnn_benchmark
I see ~ 14% memory reduction 12.2 GB → 10.5 GB but by disabling cudnn_benchmark my forward pass time increases significantly.
My model has 12 million trainable parameters.
I was expecting it to reduce much further as the model is quite big so the effect of FP16 should be more evident. My model contains quite some 3D convolutions as well.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   56C    P0    66W /  70W |  10578MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

What do you think is the reason of such a behavior?
thanks.

The memory usage given in nvidia-smi will give you the reserved memory in PyTorch (allocated + cached) as well as the CUDA context (and all other processes).
Setting cudnn.benchmark=True will run different cudnn kernels for each new input shape and select the fastest one. This behavior might use more memory in the initial iteration and the cached memory will thus increase. In case a profiled algorithm uses a workspace, which would cause an out of memory issue, the algo is skipped.
To check the allocated memory, you can use torch.cuda.memory_allocated().

@ptrblck
sorry for the late reply,
so I profiled the memory allocation at the start of the training loop and reporting values at the start of 3rd iteration after which these values remain constant for each setting.
If you could please explain this behavior, it would be really helpful.
p.s just for validation, I have tried gc.collect() , deleting input tensors, outputs, clearing cache but these strategies don’t change anything.

Based on your output, the FP16 approach uses less memory in the run, while creates a higher peak memory.
As explained before, the higher memory peak might be created e.g. by the different profiled kernels when cudnn.benchmark=True is used.

@ptrblck
As we know cudnn.benchmark=True is generally recommended for speed ups so turning it off would slow down the training considerably.
Also, even if FP16 uses less memory in the run, we see that even with cudnn disabled the reduction of memory is very minimal (cause peak memory consumptions would cause the OOM) so unless and until these peak memory consumptions go down, GPUs will run out of memory. What should be the mitigation strategy in this case.

No, that shouldn’t be the case, as explained before:

Could you post a setup, where an OOM error is raised?