GANs 2 Sequential Blocks vs a Concatenated Block

I have the following 2 Sequential blocks:

net = Sequential(
                nn.Conv2d(1, self.complexity, 4, 2, 3),
                nn.LeakyReLU(0.2, inplace=True),
                nn.Conv2d(self.complexity, self.complexity * 2, 4, 2, 1),
                nn.BatchNorm2d(self.complexity * 2),
                nn.LeakyReLU(0.2, inplace=True),
                nn.Conv2d(self.complexity * 2, self.complexity * 4, 4, 2, 1),
                nn.BatchNorm2d(self.complexity * 4),
                nn.LeakyReLU(0.2, inplace=True),
) 

and the output of the above network is passed into

classification_layer = Sequential(nn.Conv2d(self.complexity * 4, 11, 4, 1, 0))

where self.complexity is 128.

When I combine the two into one Sequential block by doing the following:

net = Sequential(
                nn.Conv2d(1, self.complexity, 4, 2, 3),
                nn.LeakyReLU(0.2, inplace=True),
                nn.Conv2d(self.complexity, self.complexity * 2, 4, 2, 1),
                nn.BatchNorm2d(self.complexity * 2),
                nn.LeakyReLU(0.2, inplace=True),
                nn.Conv2d(self.complexity * 2, self.complexity * 4, 4, 2, 1),
                nn.BatchNorm2d(self.complexity * 4),
                nn.LeakyReLU(0.2, inplace=True),
                nn.Conv2d(self.complexity * 4, 11, 4, 1, 0)
),

my network behaves differently. I am using these to train a DCGAN on the MNIST dataset for semi-supervised learning. In the first case, I do:

output = classification_layer(net(input))

and I get high accuracies (~90%) whereas in the second case, I do:

output = net(input)

and get lower accuracies (~60%).

Any idea as to why this could be happening? Is my assumption that these two are the same models correct?

I would greatly appreciate any help!

How reproducible are these findings, i.e. have you checked the accuracy and stddev of multiple runs and are always seeing the discrepancy (for different seeds of course)?

Both approaches should yield the same result, so I assume you were unlucky in the second method.
However, if this behavior is reproducible, could you post a code snippet (using random tensors) so that we could reproduce it, please?

Hi @ptrblck. Thank you for getting back to me. The results are actually reproducible. I’ve run numerous experiments with the two models and they seem to yield accuracies in the ranges of ~90% and ~60% each time respectively.

I’m a little confused as to which part of the code you would like me to post. Should I post the code I use to calculate the losses/accuracies? In my code the seed is always set to 1 for both models so that I can reproduce the error.

Could you rerun the codes with different seeds?

We would need a code snippet to reproduce this issue, i.e. the model definition as well as the optimizer, input shapes etc. so that we could run the code with random tensors.

Oh, sure thing, I’ll upload a file here with the code in just a few minutes

Hi @ptrblck. I’ve extracted all the code related to the issue into a couple of files. I unable to upload non-image files here. Is there somewhere else you would prefer I send you the files (there are 2 python files - one is a custom batch sampler that I use and the other is the main file for my model)?

You could create a Gist on GitHub and store the code there.

@ptrblck I will do that right away. I was actually going over my log files and noticed that with the second approach (the single Sequential block) I am getting very high Discriminator losses (around 6 or 7) whereas with the first approach it was in the range of 0.5-0.7. Any idea why this could be happening for a semi-supervised GAN or would you need to see the code snippet for that?

I still think that the random initialization might be different for both approaches, as the models are identical.
One thing you could also try is to store the state_dict of the first model, and load it manually into the second model before training both.
Since the parameters are now equal it would be interesting to see, if you are still observing different results.

Oh, I think that would really be an interesting way to approach the problem, I’ll give it a try for sure and update you.

I think, as you say, the models are indeed equivalent. I did some further inspection and realised that I wasn’t using .zero_grad() on the classification layer separately when training. As in, I always called, model.net.zero_grad() followed by loss.backward() and optimizer.step() and this led to the difference in performance. It is quite interesting though to see that not calling model.classification_layer.zero_grad() caused such a large difference. I’ll close this discussion for now and mark it resolved. Thank you for all the help. :blush: