Upsample+Conv2d vs ConvTranspose2d

max_mz · November 29, 2021, 12:31pm

Hi everyone,

There have been topics about the difference between torch.nn.Upsample and torch.nn.ConvTranspose2d but I don’t see anyone speaking about the difference between :

1 : torch.nn.Upsample + torch.nn.Conv2d
2 : torch.nn.ConvTranspose2d

I am asking the question because i saw U-Net implementation using 1 and GAN/Autoencoder implementation using 2 but i don’t really know why.

If you have any hint i would really appreciate it.

KFrank · November 30, 2021, 3:21am

Hi Max!

Upsample plus Conv2d and ConvTranspose2d would do similar
things, but they differ distinctly in detail.

Use Upsample (without Conv2d) if you want cheaper upsampling,
but without trainable parameters, and use ConvTranspose2d if you
want the trainable parameters.

Although you could add Conv2d to Upsample to get trainable
parameters and similar functionality to ConvTranspose2d, you
wouldn’t get the full benefit.

Consider ConvTranspose2d with a 2x2 kernel. It, in effect, duplicates
each pixel to form a 2x2 block of identical pixels, and then applies
each of the 4 in_channels x out_channels slices in its 2x2
height x width spatial array as single-pixel “depthwise” convolutions
to each of the 4 pixels in the 2x2 block.

If, instead, you use Upsample to explicitly create that 2x2 block
of identical (or non-identical, but similar interpolated pixels), and
then apply a Conv2d with a 2x2 kernel, its 4 in_channels x
out_channels depthwise convolutions get summed together,
because they are summed across the 4 identical pixels in the 2x2
block (or blended together if Upsample used interpolation).

So you have the same number of parameters and cost of training
as using ConvTranspose2d, but you’ve thrown away (summed
together) three-fourths of that expressiveness and trainability.

Best.

K. Frank

max_mz · November 30, 2021, 2:34pm

Hi K. Frank,

Thanks for the fast and clear response.

If i understand correctly the 2 approaches can do vaguely the same thing. But to get an intuition of what’s different :

With Upsample and Conv2d you will mix together pixels along space dimension.
Whereas if you only use ConvTranspose2d with (2,2) kernel and stride (2,2) the pixels are going to stay separated along the spatial dimension. (it would not be the case for a kernel size > (2,2) but still the pixels would not be mixed as much as with Upsample and Conv2d)

Have a good day,
Max

pu239 · July 30, 2024, 11:38am

for people that stumble upon this: read the distill.pub article on this, their blogs were the best