Hi,
Yes exactly, that is why in kernel size you just provide (h, w) not channel size of it, because it has to match in_channels of conv2d layer. And again you are right, output_channels is the number of filters with size of (in_channels, h, w) which will be stacked together at the end.
Bests
