Convolution slow compared to PyCUDA

I ran a comparison in 1D and 2D of PyTorch convolution with simple PyCUDA convolution kernels in a Colaboratory GPU (K80) notebook. In these tests, PyTorch is significantly slower than PyCUDA, even though my PyCUDA kernels are very simple implementations (everything in global memory, etc.). I think I am doing everything right: The PyTorch tensors are on the GPU, I have CuDNN benchmarking enabled, and I synchronize before measuring the runtime, so I am surprised by the performance difference. I would be grateful if you looked at the (quite short and simple) notebook to see if I made a mistake: https://colab.research.google.com/drive/1oF9wH0UDGqWanmZ2B04YH8U0RxEydR9_