You can hack up a conv operation to do this.
If you want a patch of say 8x8 just do a 8x8x64 convolution with zero padding and let the kernel be 1's in different positions with all zeros. After you do this your 1x1x64 will be your 8x8 patches.
This will probably be very inefficient with maximally sparse convolutions, but unless you intend to do this in an iterative manner it shouldn't be noticeable. And should be faster than any loop you can cook up probably. With some extra effort you can make a better array programming solution, something involving reshapes and permutes probably.