Recently, I have started to work on larger images (brain), and I found a big slow down going from size (1, 2, 224, 224, 224) to (1, 2, 256, 256, 256) I was expecting something slower but not by this amount:
I think it could be that you’re missing the use_cuda flag from the profiler doc.
This would mean that for all cuda ops, you only measure the time to launch the kernel but not for it to run. And operations that create synchronization (like copy to the cpu), will have a large runtime just because they wait on the rest to actually run on the GPU.
I would say this is most likely because cudnn’s default algorithm behavior change when you get something >255.
You can use torch.backends.cudnn.benchmark=True so that cudnn will pick the fastest algorithm depending on your input/hardware. That should smooth things down