Question about transposed convolution

I’m currently reading paper Visualizing and Understanding Convolutional Networks, which is known as ZFnet.
In this paper, deconvolution is done by simply transposing the original filter used by convolution layer, then do the convolution with that.

  1. Some people implemented above paper with pytorch nn.ConvTranspose2d function, but is it the right function? In pytorch documentation, it says that

This module can be seen as the gradient of Conv2d with respect to its input. It is also known as a fractionally-strided convolution or a deconvolution (although it is not an actual deconvolution operation).

If it’s the right choice, then is the gradient of Conv2d is same with the Convolution with transposed filter?

  1. Is there any suggested paper to understand ConvTranspose2d module? I could understand what this function does with this animation, but I don’t understand fully why do they use this module for (pseudo) inverse of convolution, especially the weight sharing part. It seems it’s not mathematical inverse function of convolution, then there should be some reasons for choosing this module. Is it just chosen as heuristic? Or is there any other reason for this?

  2. In another version of same paper, author stated

The convnet uses learned filters to convolve the feature maps from
the previous layer. To approximately invert this, the deconvnet uses transposed
versions of the same filters (as other autoencoder models, such as RBMs), but
applied to the rectified maps, not the output of the layer beneath. In practice
this means flipping each filter vertically and horizontally.

Does RBM use this kind of technique(transpose) in its operation?

I found [1603.07285] A guide to convolution arithmetic for deep learning metions about question 1 and 2.

Finally, so-called transposed convolutional layers (also known as fractionally
strided convolutional layers) have been employed in more and more work
as of late (Zeiler et al., 2011; Zeiler and Fergus, 2014; Long et al., 2015; Radford
et al., 2015; Visin et al., 2015; Im et al., 2016), and their relationship with
convolutional layers has been explained with various degrees of clarity.