How to implement a kernel-wise convolution

OK, I find the answer I need. As I cannot delete this topic, I just paste the answer here.