Mixed precision training reduced the GPU's utilization

When I use mixed precision training, the GPU’s utilization has reduced a lot, like below:

Thu Oct  8 23:42:03 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:04:00.0 Off |                  N/A |
| 51%   53C    P2   116W / 250W |   8990MiB / 11019MiB |     78%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:05:00.0 Off |                  N/A |
| 58%   56C    P2   201W / 250W |   8990MiB / 11019MiB |     80%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:08:00.0 Off |                  N/A |
| 58%   56C    P2   151W / 250W |   8990MiB / 11019MiB |     79%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:09:00.0 Off |                  N/A |
| 58%   56C    P2   108W / 250W |   8990MiB / 11019MiB |     75%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  Off  | 00000000:84:00.0 Off |                  N/A |
| 59%   56C    P2   150W / 250W |   8990MiB / 11019MiB |     77%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  Off  | 00000000:85:00.0 Off |                  N/A |
| 57%   56C    P2   102W / 250W |   8990MiB / 11019MiB |     81%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce RTX 208...  Off  | 00000000:88:00.0 Off |                  N/A |
| 53%   54C    P2   163W / 250W |   8990MiB / 11019MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce RTX 208...  Off  | 00000000:89:00.0 Off |                  N/A |
| 61%   57C    P2   141W / 250W |   8990MiB / 11019MiB |     72%      Default |
+-------------------------------+----------------------+----------------------+

But with fp32, it was almost nearly 100%. What wrong happens? My environments are:

In [1]: import torch

In [2]: torch.__version__
Out[2]: '1.6.0'

In [3]: torch.version.cuda
Out[3]: '10.2'

In [4]: torch.backends.cudnn.version()
Out[4]: 7605

The GPU utilization doesn’t correspond to the speed, so did you profile the code and see a speedup or slowdown?
E.g. if mixed-precision training is giving you a speedup e.g. by using TensorCores, your code might now suffer (more) from a potential data loading bottleneck, which would reduce the GPU utilization.