Well, not really. Currently you are using a signal of shape [32, 100, 1]
, which corresponds to [batch_size, in_channels, len]
.
Each kernel in your conv layer creates an output channel, as @krishnavishalv explained, and convolves the “temporal dimension”, i.e. the len
dimension.
Since len
is in your case set to 1, there won’t be much to convolve, as you basically passed a single time stamp with 100 channels.
Try to think about your signal as a sound source. In a simple use case you would have 2 channels (left and right) and a certain length, e.g. 1000 time stamps. Your input would thus have the shape [batch_size, 2, 1000]
.
Now if you setup a conv layer, you would have to use in_channels=2
and an arbitrary number of out_channels
. Remember, the out_channels
just define the number of kernels. Each kernel is applied separately on the input.
The kernel size defines, how much of the temporal dimension is used in a sliding window fashion.
E.g. if you set kernel_size=5
, 5 time stamps will be used for the convolution for each position.
In your use case, however, we only have one single time stamp, so that you could easily use a linear layer instead.
CS231n explains this concept really well.