Now i am studying “swish net” that model for audio segmentation.

In that paper, they used strided convolution & residual net. Follw image is from https://arxiv.org/abs/1812.00149.

after through stride=2 conv layer, its output length will be half of the input length.

Here, my question is…

how can merger output with input(residual connection) even their array dimension is mismatched?

G.A is just gated activation function, so it doesnt affect on the output dimension.