Using Parallel Layers Makes My NN not Converge

I am tuning VGG16 network by adding multiple classifiers as the final classification layers. Based on a recommendation on this site, I created this class:

class Parallel(torch.nn.Module):
    def __init__(self, modules: List[torch.nn.Module]):
        self.modules = modules

    def forward(self, inputs):
        return [module(inputs) for module in self.modules]

And then I replaced the last layer in VGG16 using this command:

net = torchvision.models.vgg16_bn(pretrained=True)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
for param in net.parameters():
    param.requires_grad = True
k = 6
net.classifier._modules[str(k)] = Parallel(
    [nn.Linear(4096, classes_num).to(device) for i in range(classifiers_num)]).to(device)

I noticed when classifiers_num is equal to 1, the training error never decreases. However, when I replace the last layer with nn.Linear without using my Parallel class, it does decrease:

net.classifier._modules[str(k)] = nn.Linear(4096, classes_num)

I am not sure what the different is. What am I missing?

Hi Randomnd!

This returns a list of whatever the modules return.

So in this Parallel case _modules[str(k)] returns a list of the
outputs of the Linears, so presumably a list of tensors of shape
[nBatch, classes_numeber].

Note, even if classifiers_num = 1, you still return a list that contains
a single tensor, not just the tensor, itself.

However, in this case without Parallel, _modules[str(k)] is simply a
Linear and so returns a tensor of shape [nBatch, classes_numeber]
and does not return a list.

If your code that computes a loss from the output of net does not treat
these two cases – list vs. tensor – differently, then you are likely not
computing the loss correctly in (at least) one of the cases.


K. Frank

Thanks for the reply. My code does treat the two cases but I avoided including them.

#loop over the `classifiers_num `
for k, outputs in enumerate(batch_outputs):
         loss = loss + criterion(outputs, labels_batched[:, k])

In case of only using a Linear without a list, then I would do this : batch_outputs = [net(data_batched)] and execute the piece of code above.

Hi Randomnd!

Indexing labels_batched with k looks fishy to me. How are data_batched
and labels_batched supposed to line up with one another? How is
labels_batched supposed to know in advance how many elements you
have in your Parallel and, therefore, what the length of batch_outputs
will be?

Irrespective of that detail, for further debugging I would start by setting up
net with a single Linear as its last layer, and then again with a Parallel
containing a single Linear as its last layer, making sure that the two
Linears are initialized identically. (Or maybe you could just reuse the
same Linear for both cases.) Then run the same input data (one batch)
through the two versions of net and check that you get the same loss.
If that works, try performing a single backward pass, and check that you
get the same grad for the parameters of net and, in particular, the same
grad for you last-layer Linear.


K. Frank