Does pytorch or cuda have any specific optimization or something?
No it is not mandatory.
And power of 2 are not particularly important either.
Maybe powers of 32 that are the size of the streaming multiprocessors? But even that depends a lot on how the cuda kernel is implemented and, in general, won’t lead to any significant difference.