High CPU Usage?

By default TF32 is already used, so the TensorCores on your device will already be utilized, which might not leave a lot of performance benefits left for AMP. This issue with the related double post might be interesting for you.
In any case, you could create profiles with a visual profiler such as Nsight Systems and check for other bottlenecks, such as data loading, which might be the current bottleneck.
Also refer to the Performance Tuning Guide and, if possible, you could also try out the latest CUDA + cudnn versions by building from source, which could yield additional performance improvements.