What kind of operations are you using and which cudnn version in particular?
If you are using cudnn 7.3 and later, convolutions should use TensorCores for FP16 inputs.
GEMMs (e.g. used in linear layers) however have a size restriction of multiples of 8. For matrix A x matrix B, where A has size [I, J] and B has size [J, K], I, J, and K must be multiples of 8 to use TensorCores. This requirement exists for all cublas and cudnn versions.
Also, could you try to use torch.backends.cudnn.benchmark = True
at the beginning of your script?