Hi, I am trying to figure out how a nn.conv1d processes an input for a specific example related to audio processing in a WaveNet model. I have input data of shape (1,1,8820), which passes through an input layer (1,16,1), to output a shape of (1,16,8820). That part I understand, because you can just multiply the two matrices. The next layer is a conv1d, kernel size=3, input channels=16, output channels=16, so the state dict shows a matrix with shape (16,16,3) for the weights. When the input of (1,16,8820) goes through that layer, the result is another (1,16,8820). What multiplication steps occur within the layer to apply the weights to the audio data? In other words, if I wanted to apply the layer(forward calculations only) using only numpy for this example, how would I do that?
Hi @Keith72, this is how pytorch conv1d actually do in your case:
x = torch.rand(1, 16, 8820) weight = torch.rand(16, 16, 3) # first pad zeros along the time dimension x = torch.pad(x, [1, 1]) #shape = (1, 16, 8822) #unfolded, so you have 8820 moving windows with size = (16, 3) x = x.unfold(2, 3, 1) #shape = (1, 16, 8820, 3) # matrix multiplication, I use tensordot for simplicity y = torch.tensordot(x,weight, dims=([1, 3], [1, 2])) #shape = (1, 16, 8820)
In numpy you can simply replace
tensordot with corresponding numpy function; for
unfold you can use
Thanks for the quick response! My initial implementation seems to fit using your steps, except the last step gave me a shape (16,1,8820), so I just swapped the first two dimensions. Now if I wanted to account for layer dilation, how would that work?
That can be achieved easily by using indexing:
x = torch.pad(x, [1 * dilation] * 2) x = x.unfold(2, 2 * dilation + 1, 1)[..., ::dilation] #shape = (1, 16, 8820, 3) ...
In numpy you can alternate the stride size of the ndarray to do dilated convolution. Here’s an example implemention, you can check it for details.
The example helps a lot, and thank you for taking the time to explain all that.