Trying to implement a NN described in a paper. Don't understand what is meant by size


This is the first time I’m trying to implement a NN which is described in a paper. So far I just did my own thing by trial and error. I don’t fully understand how I should read the table found here:

Note that the table only contains the first few layers.

Now I want to use torch.nn.Conv1d(in_channels, out_channels, kernel_size,...) for that but I have no idea how to read the table in the image. It is clear that it is a sequential NN but what exactly is meant by size? It says the size of the 2nd layer (conv1d) is 32x40945 and the size of the 3th layer (pooling) is 32x10236. What is now in_channels, out_channels and kernel_size for the 2nd layer exactly? Note that in and out channels are integers.


It seems to be the size of the input tensor.
so it starts being a vector of 40960 elements.
Then becomes a matrix of 1x40960.
Then theyh apply a 1d conv with inchannels=1 and out_channels=32 and kernel=16
(Still if I were you I would use 15 as kernels are better to be odd.I

First a bit more context. The input is a timeseries and we want to do signal processing. We sample the signal for 4096 Hz and apparently 10 seconds. Thus the input size of 40960.

Now the size in the 2nd layer (conv1d) is 32 x 40945

What exactly describes 40945 and how do I get this number?

For me, a CNN takes an array (1d) or an image (2d) and apply some stencil/kernel to it, by doing so reduced the dimension. I can’t see why we have a difference of 15 here (40945 vs 40960).

Edit: How do you see the kernel is of size 16?

I just see it by expertise :slight_smile: but maths can be used too!

Basically when you apply a convolution the resulting signal 's size is gonna be reduced unless you pad the input. (You can google about it).
In this case the size of the output is 40945. Since it’s more or less the same size as the input, we can know the stride=1. For stride 2 the signal size would be halved.
If you check the docs, you have the mathematical expression to calculate the size:

In this case we can see the following:

  • Input signal’s size ~= output’s size → stride=1
  • Dilation=1 as they say nothing about it. They could use diletion but, if that were the case, they could just downsample the signal beforehand.
  • Padding=0 as the size of the input is similar to the size of the output. Still they could pad and use a bigger kernel size.
    Then if we apply the formula we have 40960 - 2*(16-1)+1-1= 490945.

So in short 40945 describes the number of elements in the time series after you apply the convolution.
32 describes the amount of filters used.

Okay, thanks a lot. So the size described in the image is basically C_out x L_out.

I guess that makes sense. Still think it’s a bad way of representing your NN thought.

Maybe a follow-up question. Nothing here takes into account the batch size right? (I have another thread where I try to figure out the implementation side of this)

Thanks again!

Yep, neural networks are usually described by the kernel size, padding, stride and so on… Such a strange work the one you found.

All the pytorch layers expect a batch dimension. If you have a single 1d sample, it has to be of shape NxCxT.
1D signals–> NxCxT usually c=1 for mono audio or time series, c=2 for dual audio
2D signals → NxCxHxW usuall c=1 for b&w images, c=3 for color…
fully conected (linear layers)–> Nx*xT It expects the first dimension to be batch, then you can have as many dimensions as you want (it will recursively apply the fc among them all like in a nested for loop)

So basically the, for a 1D signal, if you apply a convolution to a batched tensor NxCxT
it’s equivalent to do a for loop from n=0 to n=N. They are independent.

In the papers you don’t describe the batch dimension, still both pytorch and tensorflow use it in practice.

thanks a lot. I’ll make a new thread about how I implement the CNN with variable batch sizes because something I do is wrong and I just can’t find it in the docs.

Just as a note: the 2* is wrong, should be 1*.