I don’t know which code you are using or referring to. In case you are using conv layers, use torch.backends.cudnn.benchmark = True, add some warmup iterations, and profile the code again.
You could also profile the code using the PyTorch profiler or e.g. Nsight Systems to check for potential bottlenecks.