Convolution slow compared to PyCUDA

I ran a comparison in 1D and 2D of PyTorch convolution with simple PyCUDA convolution kernels in a Colaboratory GPU (K80) notebook. In these tests, PyTorch is significantly slower than PyCUDA, even though my PyCUDA kernels are very simple implementations (everything in global memory, etc.). I think I am doing everything right: The PyTorch tensors are on the GPU, I have CuDNN benchmarking enabled, and I synchronize before measuring the runtime, so I am surprised by the performance difference. I would be grateful if you looked at the (quite short and simple) notebook to see if I made a mistake: