Giving stride, padding, and kernel_size to the convolution

brighteningeyes · May 29, 2020, 3:18am

hello,
I want to code my crnn, and I am lost in giving stride, padding and kernel_size to nn.Conv2d
consider that we have an image of 224x224, how would i calculate padding, stride and kernel_size for each convolution layer in such a way that makes it possible to feed that to rnn?
thanks.

ptrblck · May 29, 2020, 9:55am

nn.RNN expects an input tensor in the shape [seq_len, batch_size, features] in the default setup as described in the docs.
The convolution arguments are generally independent from this shape as long as you know, how you would like to reshape the last conv output.
Note that a conv layer will output an activation in the shape [batch_size, nb_filters, height, width], which you would have to reshape to the expected input shape of the RNN.

brighteningeyes · May 29, 2020, 12:36pm

@ptrblck thanks.
so, how would I reshape it for feeding it to rnn? I mean, when the
image is not sequenced, how would i make it sequenced in such a way
that sequence length is 1?
because if i reshape without considering the size of the tensor, I
will get error.
thanks.

۱۳۹۹-۰۳-۰۹ ۱۴:۳۵ ‎+۰۴:۳۰ گرینویچ, ptrblck via PyTorch Forums
noreply@discuss.pytorch.org:

ptrblck · May 30, 2020, 4:38am

If you don’t want to create a “temporal” dimension, you could reshape the activation such that the sequence length would be 1, the batch size would stay the same, and all other activation values would be the features via:

out = ... # assuming it has the shape `[batch_size, nb_filters, height, width]
out = out.view(1, out.size(0), -1) # [1, batch_size, nb_filters*height*width]

However, without a temporal dimension, I’m not sure if an RNN would make much sense.
Could you explain your use case a bit?

brighteningeyes · May 30, 2020, 5:28am

I have some images of captcha (which are generated by myself) and I
want to find the texts inside it to experiment with simple OCR use
cases.

۱۳۹۹-۰۳-۱۰ ۹:۱۸ ‎+۰۴:۳۰ گرینویچ, ptrblck via PyTorch Forums
noreply@discuss.pytorch.org:

ptrblck · May 30, 2020, 5:33am

In that case you could also try to pass the spatial dimensions (height*width) as the temporal dimension and the channels as the feature dimension.
Still unsure, how well this would work, but might be a reasonable approach.

brighteningeyes · June 1, 2020, 3:27am

so, the shape for RNN should be [channels, batch_size, width*height]?

ptrblck · June 1, 2020, 3:46am

No, in the default setup, this shapes is expected:

If you want to use the pixels as the temporal dimension, you would have to permute them to dim0.
Note that you could also set batch_first during the creation of the RNN, which would then expect the input in the shape [batch_size, seq_len, features].