Pytorch equivalent of tensorflow conv2d_transpose filter tensor

The Pytorch docs give the following definition of a 2d convolutional transpose layer:

torch.nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True, dilation=1)

Tensorflow’s conv2d_transpose layer instead uses filter, which is a 4d Tensor of [height, width, output_channels, in_channels]. I’ve seen it used in networks with structures like the following:

4 × 4 × 1024 → 8 × 8 × 1024 → 16 × 16 × 512 → 32 × 32 × 256 → 64 × 64 × 128 → 128 × 128 × 64 → 256 × 256 × 3 (used for the generator of a DCGAN).

Can I use kernel_size as a substitute for filter (e.g. nn.ConvTranspose2d(256,128,64,1) in my model ?

Thanks in advance!

1 Like

If you would like to use a similar method, you could use the functional API.
Using this approach you can specify the weight with dimensions [in_channels, out_channels, kernel_w, kernel_h]:

import torch.nn.functional as F

weights = ...
F.conv_transpose2d(input, weight)

If you don’t need the functional API, you could use the Module which you’ve already mentioned.
kernel_size is just part of the arguments you have to specify, so it’s not a substitute for filter.

Have a look at the DCGAN example to see how it’s used.

2 Likes

Thanks! I have been using Module, but I’m not sure how to recreate the following network structure with it, which is why I wondered about filter.

This is the GAN generator (taken from Creative Adversarial Network ):

" z ∈ R100 normally sampled from 0 to 1 is up-sampled to a 4× spatial extent convolutional representation with 2048 feature maps resulting in a 4 × 4 × 2048 tensor. Then a series of four fractionally-stride convolutions (in some papers, wrongly called deconvolutions). Finally, convert this high level representation into a 256 × 256 pixel image. In other words, starting from z ∈ R 100 → 4 × 4 × 1024 → 8 × 8 × 1024 → 16 × 16 × 512 →
32 × 32 × 256 → 64 × 64 × 128 → 128 × 128 × 64 → 256 × 256 × 3 (the generated image size)"

That’s the goal, although given hardware limitations I may have to change the above somehow to be compatible with 64x64 images.

Thanks again, and sorry for all the bother!

I’ve adapted the DCGAN code for your use case:

nz = 100
x = Variable(torch.randn(1, nz, 1, 1))

class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        
        self.conv1 = nn.ConvTranspose2d(nz, 2048, 4, 1, 0, bias=False)
        self.bn1 = nn.BatchNorm2d(2048)
        self.conv2 = nn.ConvTranspose2d(2048, 1024, 4, 2, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(1024)
        self.conv3 = nn.ConvTranspose2d(1024, 512, 4, 2, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(512)
        self.conv4 = nn.ConvTranspose2d(512, 256, 4, 2, 1, bias=False)
        self.bn4 = nn.BatchNorm2d(256)
        self.conv5 = nn.ConvTranspose2d(256, 128, 4, 2, 1, bias=False)
        self.bn5 = nn.BatchNorm2d(128)
        self.conv6 = nn.ConvTranspose2d(128, 64, 4, 2, 1, bias=False)
        self.bn6 = nn.BatchNorm2d(64)
        self.conv7 = nn.ConvTranspose2d(64, 3, 4, 2, 1, bias=False)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        x = F.relu(self.bn4(self.conv4(x)))
        x = F.relu(self.bn5(self.conv5(x)))
        x = F.relu(self.bn6(self.conv6(x)))
        x = F.tanh(self.conv7(x))
        return x

model = Generator()
output = model(x)

I’ve set the channels to a fixed number so match your example.
To make the model more flexible you could pass a variable with a basic channel size and half it in every layer.
Let me know if this works for you!

1 Like

Once again, thanks so much for your willingness to help @ptrblck - it’s awesome.

I’m new to Pytorch and the forum helps loads - makes a huge difference to the user-friendliness of Pytorch.

It seems forward() returns torch.Size([64, 3, 256, 256]). These last two values seem iffy…

And unfortunately the above leads to problems when I forward a fake image to my discriminator.
I get RuntimeError: size mismatch, m1: [64 x 32768], m2: [2048 x 1024] .

The discriminator structure also follows the paper:

“The common body of convolution layers is composed of a series of six convolution layers (all with stride 2 and 1 pixel padding). conv1 (32 4 × 4 filters), conv2 (64 4 × 4 filters, conv3 (128 4 × 4 filters, conv4 (256 4 × 4 filters, conv5 (512 4 × 4 filters, conv6 (512 4 × 4 filters). Each convolutional layer is followed by a leaky rectified activation (LeakyRelU) in all the layers of the discriminator. After passing a image to
the common conv D body, it will produce a feature map of size (4 × 4 × 512). The real/fake Dr
head collapses the (4 × 4 × 512) by a fully connected to produce Dr(c|x) (probability of image
coming for the real image distribution). The multi-label probabilities Dc(ck|x) head is produced
by passing the(4 × 4 × 512) into 3 fully collected layers sizes 1024, 512, K, respectively, where K
is the number of style classes.”

     num_disc_filters = 64
     y_dim = 27
     self.conv = nn.Sequential(
            nn.Conv2d(channels, num_disc_filters//2, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
      
            nn.Conv2d(num_disc_filters//2, num_disc_filters, 4, 2, 1, bias=False),
            nn.BatchNorm2d(num_disc_filters),
            nn.LeakyReLU(0.2, inplace=True),
     
            nn.Conv2d(num_disc_filters, num_disc_filters * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(num_disc_filters * 2),
            nn.LeakyReLU(0.2, inplace=True),
       
            nn.Conv2d(num_disc_filters * 2, num_disc_filters * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(num_disc_filters * 4),
            nn.LeakyReLU(0.2, inplace=True),

            nn.Conv2d(num_disc_filters * 4, num_disc_filters * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(num_disc_filters * 8),
            nn.LeakyReLU(0.2, inplace=True),

            #Inclusion of this last layer leads to the wonderful
            # RuntimeError: Calculated input size: (3 x 3). Kernel size: (4 x 4). 
            # a problem for another day :(. Thus currently commented out
             #nn.Conv2d(num_disc_filters * 8, num_disc_filters * 8, 4, 2, 1, bias=False),
            #nn.BatchNorm2d(num_disc_filters * 8),
            #nn.LeakyReLU(0.2, inplace=True),
        )
        self.final_conv = nn.Conv2d(num_disc_filters * 8, 1, 4, 2, 1, bias=False)
        self.sig = nn.Sigmoid()
        self.fc = nn.Sequential() 
        self.fc.add_module("linear_layer.{0}".format(num_disc_filters*16),nn.Linear(num_disc_filters*16*2,num_disc_filters*16))
        self.fc.add_module('relu.{0}'.format(num_disc_filters*16), nn.LeakyReLU(0.2, inplace=True))
        self.fc.add_module("linear_layer.{0}".format(num_disc_filters*8),nn.Linear(num_disc_filters*16,num_disc_filters*8))
        self.fc.add_module('relu.{0}'.format(num_disc_filters*8), nn.LeakyReLU(0.2, inplace=True))
        self.fc.add_module("linear_layer.{0}".format(num_disc_filters),nn.Linear(num_disc_filters*8,y_dim))
        self.fc.add_module('relu.{0}'.format(num_disc_filters), nn.LeakyReLU(0.2, inplace=True))
        self.fc.add_module('softmax',nn.Softmax())
       
    def forward(self, inp):

        x = self.conv.forward(inp)
        real = self.final_conv(x) # its size is [64,1,1,1]
        x = x.view(x.size(0),-1) # To flatten the 2D into 1D -> now has size [64,2048]
        real_out = self.sig.forward(real) 
        real_out = real_out.view(-1,1).squeeze(1)
        style = self.fc.forward(x) # THIS IS THE LINE THAT FAILS
        style = torch.mean(style,1) #Get the mean of the style (class labels) to ensure a 64x1 tensor as opposed to 64 x y_dim
        return real_out,style

Any ideas? Apologies once again for bothering you so much!

I’m glad you like it here! :wink:

The returned generated image has dimensions [batch_size, channels, width, height].
As far as I know the default format in Tensorflow stores the channels at the last position. The returned size should be alright.

Based on the error message, it seems you are flattening the Tensor somewhere and feeding it into a Linear layer gives a size mismatch.
Could you print the shape in the __forward__ method of x just before passing it to self.fc?

Also as a side note, you should call the Module directly instead of .forward, since hooks might not be properly registered.

1 Like

Oh okay - so the width and height of the size array being 256 is fine? Apologies if I’m a bit thick, but would have thought 256 would be a relic of the original model using 256x256 images (and thus be wrong for a 64x64 model).

x.size() returns torch.Size([64, 512, 2, 2]) before I flatten the Tensor in this line: x = x.view(x.size(0),-1) to make its shape/size [64,2048].
I.e. just before passing it to self.fc in the line style = self.fc.forward(x) , x.size() returns torch.Size([64, 2048]).Does this signal anything?

Also thanks for the tip about calling the Module directly, not sure why I didn’t!

Could you explain your concerns about the size?
The generator creates images of shape [3, 256, 256], so the discriminator should get the same shaped input to tell if the current image is a fake or real.
Are you using somehow a 64x64 model?

The size matches the layer, so that’s strange.
In your first error message it says:

RuntimeError: size mismatch, m1: [64 x 32768], m2: [2048 x 1024]

Which fits your assumption about the image size. Since 64 * 4 = 256, we could divide 32768 by 4*4=16 and get 2048.
So back to the question: Your generator should output [3, 256, 256] images and the generator should take these, right?

Ok, I’ve read through the paper and tried to change your code accordingly.

Let’s walk through the changes and see if it’s right.

The common body of convolution layers is composed of a series of six convolution layers. conv1 … to conv6

It seems in your architecture you used the “common body” only for the real/fake output and skipped the conv6 layer for the style output.
Let’s make sure both losses use all 6 conv layers.

After passing a image to the common convD body, it will produce a feature map or size (4×4×512)

Now that we use all 6 conv layers we get the desired output.

The real/fake Dr head collapses the (4×4×512) by a fully connected to produce Dr(c|x)(probability of image coming for the real image distribution)

I’ve added a self.real_fake_head as a nn.Linear layer to achieve this.
It’s defined as self.real_fake_head = nn.Linear(512*4*4, 1).

The multi-label probabilities Dc(ck|x) head is produced by passing the(4×4×512) into 3 fully collected layers sizes 1024, 512, K, respectively, where K is the number of style classes.

I’ve changed the number of input features to 512*4*4 so that we can use these linear layers as in the paper.

Here is the full code for the discriminator. I’ve kept most of the code and changed just a bit:

class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()
        num_disc_filters = 64
        y_dim = 27
        channels = 3
        self.conv = nn.Sequential(
                nn.Conv2d(channels, num_disc_filters//2, 4, 2, 1, bias=False),
                nn.LeakyReLU(0.2, inplace=True),
          
                nn.Conv2d(num_disc_filters//2, num_disc_filters, 4, 2, 1, bias=False),
                nn.BatchNorm2d(num_disc_filters),
                nn.LeakyReLU(0.2, inplace=True),
         
                nn.Conv2d(num_disc_filters, num_disc_filters * 2, 4, 2, 1, bias=False),
                nn.BatchNorm2d(num_disc_filters * 2),
                nn.LeakyReLU(0.2, inplace=True),
           
                nn.Conv2d(num_disc_filters * 2, num_disc_filters * 4, 4, 2, 1, bias=False),
                nn.BatchNorm2d(num_disc_filters * 4),
                nn.LeakyReLU(0.2, inplace=True),
    
                nn.Conv2d(num_disc_filters * 4, num_disc_filters * 8, 4, 2, 1, bias=False),
                nn.BatchNorm2d(num_disc_filters * 8),
                nn.LeakyReLU(0.2, inplace=True),
    
                #Inclusion of this last layer leads to the wonderful
                # RuntimeError: Calculated input size: (3 x 3). Kernel size: (4 x 4). 
                # a problem for another day :(. Thus currently commented out
                 #nn.Conv2d(num_disc_filters * 8, num_disc_filters * 8, 4, 2, 1, bias=False),
                #nn.BatchNorm2d(num_disc_filters * 8),
                #nn.LeakyReLU(0.2, inplace=True),
            )
        self.final_conv = nn.Conv2d(num_disc_filters * 8, 512, 4, 2, 1, bias=False)
        
        self.real_fake_head = nn.Linear(512*4*4, 1)
        
        self.sig = nn.Sigmoid()
        self.fc = nn.Sequential() 
        self.fc.add_module("linear_layer.{0}".format(num_disc_filters*16),nn.Linear(512*4*4,num_disc_filters*16))
        self.fc.add_module('relu.{0}'.format(num_disc_filters*16), nn.LeakyReLU(0.2, inplace=True))
        self.fc.add_module("linear_layer.{0}".format(num_disc_filters*8),nn.Linear(num_disc_filters*16,num_disc_filters*8))
        self.fc.add_module('relu.{0}'.format(num_disc_filters*8), nn.LeakyReLU(0.2, inplace=True))
        self.fc.add_module("linear_layer.{0}".format(num_disc_filters),nn.Linear(num_disc_filters*8,y_dim))
        self.fc.add_module('relu.{0}'.format(num_disc_filters), nn.LeakyReLU(0.2, inplace=True))
        self.fc.add_module('softmax',nn.Softmax(dim=1))
       
    def forward(self, inp):

        x = self.conv(inp)
        x = self.final_conv(x) # its size is [64,1,1,1]
        x = x.view(x.size(0),-1) # To flatten the 2D into 1D -> now has size [64,2048]
        real_out = self.sig(self.real_fake_head(x))
        real_out = real_out.view(-1,1).squeeze(1)
        
        style = self.fc(x) # THIS IS THE LINE THAT FAILS
        style = torch.mean(style,1) #Get the mean of the style (class labels) to ensure a 64x1 tensor as opposed to 64 x y_dim
        return real_out,style


disc = Discriminator()
x = Variable(torch.randn(1, 3, 256, 256))
output = disc(x)

Sure - I want to test it out on 64x64 images first, because of training time and hardware issues etc. Apologies if I didn’t make that clear!

The discriminator above (sorry am on mobile so a bit of a business to paste it again) does by default expect 256x256.

Should I just forget about trying 64x64 first? Or is it easy to change the discriminator and generator to have the same network structure but accomodate sizes other than 256x256 (such as 64x64)? Would be nice to be able to pass variable image sizes and not hardcode everything.

Thanks again!

Ah ok, yes that makes perfectly sense!
If you would like to use smaller images like 64x64, you could just remove the last conv6 layer and adjust the input channels and features for the other layers.

If you really want different input sizes, have a look at the WGAN example. They use a condition to add layers depending on the input size.

Oh okay cool - will definitely check out that WGAN example - looks super helpful.

Last question (I promise!) - the GAN has a 2-component loss, real-vs-fake, for which I’m using the normal nn.BCELoss() , and an image style (class) classification loss , called style_criterion - I’m using a nn.MultiLabelSoftMarginLoss() [as the goal is for the discriminator to view each class as equiprobable]. I’m also using the retain_graph=True argument in all my backward() calls (for both of the losses) except the last one.

Does the below seem right?

style_labels = torch.FloatTensor(self.batch_size)
style_labels = style_labels.cuda()
#batch_styles = a Tensor of shape batch_sizex1
# contains the image style (class) labels of the current batch
style_labels = style_labels.copy_(batch_styles)
style_labels = Variable(style_labels.copy_(batch_styles))

gen_style_labels = torch.FloatTensor(self.batch_size)
# Dummy label as generator not trained on class labels
if self.cuda:
    gen_style_labels = gen_style_labels.cuda()
gen_style_labels = Variable(gen_style_labels)
gen_style_labels = gen_style_labels.fill_(real_label) 


# Discriminator
err_disc_style = style_criterion(output_styles, style_labels)
err_disc_style.backward(retain_graph=True) 
# later
disc_err = err_disc_real + err_disc_fake + err_disc_style

####

# Generator
err_gen_style = style_criterion(batch_labels,gen_style_labels)
err_gen_style.backward() # is the last backward() so don't need retain_graph=True
gen_err +=  err_gen_style # gen_err originally the conventional generator loss

I wrote this in line with the Tensorflow implementation of this GAN, but when I managed to (mostly) run it, it seemed to really mess with the generator as its output images stayed exactly the same across epochs (as opposed to the DCGAN version (with just the regular nn.BCELoss()), whose images improved from epoch to epoch.

Any ideas? Have I made a stupid mistake or something? Unfortunately the Tensorflow implementation of this GAN is confusing and I’m not a huge Tensorflow fan, so would like to implement it in the far nicer Pytorch :slight_smile:

Thanks again, you are a real lifesaver!

You are welcome!

I’m not sure about the criterion.
What kind of target do you have?
Is it really a multi-label target, i.e. multiple classes can be found in a single sample image?

The target, style_labels is a 27x1 Tensor where each element (if that’s the word) is the class label [an int in the range 0-26 inclusive].

It’s not a multi-label target in that sense (i.e. an image can only have 1 class), but the Tensorflow implementation used sigmoid_cross_entropy_with_logits, of which the equivalent in Pytorch is the loss I’m using. The normal nn.CrossEntropyLoss is then definitely way better - thanks for picking that up! The tensorflow implementation might be wrong then…

Now I guess the question is whether nn.CrossEntropyLoss satisfies what the paper requires of this loss:

Maximizing the stylistic ambiguity can be achieved by maximizing the style class posterior entropy.
Hence, we need to design the loss such that the generator G produces an image x ∼ pdata
and, meanwhile, maximizes the entropy of p(c|x) (i.e. style class posterior) for the generated images.
However, instead of maximizing the class posterior entropy, we minimize the cross entropy
between the class posterior and a uniform target distribution. Similar to entropy that is maximized
when the class posteriors (i.e., p(c|G(z))) are equiprobable, cross entropy with uniform target distribution
will be minimized when the classes are equiprobable. So both objectives will be optimal
when the classes are equiprobable. However, the difference is that the cross entropy will go up
sharply at the boundary since it goes to infinity if any class posterior approaches 1 (or zero), while
entropy goes to zero at this boundary condition. Therefore, using the cross entropy
results in a hefty penalty if the generated image is classified to one of the classes with high probability.
This in turn would generate very large loss, and hence large gradients if the generated images
start to be classified to any of the style classes with high confidence.

It’s times like these I wish my maths was better!

Puh, CrossEentropyLoss should be fine, it the target classes are equiprobable.
In other words, if you have the same number of samples for each class the criterion should be good enough?

I would try it and see what happens.
Let me know, if you could generate some nice images! :wink: