Training two models with a single optimizer and loss function is not learning anything


I was trying to train a model net that creates embeddings of images and a classifier clas that discriminates the class of two embbedings. My training routine tries to learn both the embedder and the classifier on a end-to-end fashion.

At testing time, the model does not learn anything and I’m not sure why.

The code that I use is shown below:

net = Embedder()

clas = Classifier()

optimizer = optim.Adam(list(net.parameters()) + list(clas.parameters()), lr=lr) 

criterion = nn.CrossEntropyLoss()

The training loop is:

images1 =, dtype=torch.float)
images2 =, dtype=torch.float)
labels =, dtype=torch.long)

# ================= FORWARD=================

e1 = net(images1)
e2 = net(images2)
rta = classifier(e1, e2)

loss = criterion(rta, labels)

# ================= BACKWARD =================


I think the problem is that each embedding is computed on different forwards but I’m not sure.

I already done hyper-parameters searching and checked that the dataloader is working fine. Also, I tested wich optimizer to use, non of them do.

Thanks for any advice <3!

The code looks generally fine.
One check you could additionally do is to check for valid gradients in all modules.

# or 
for name, param in net.named_parameters():
    print(name, param.grad)
# the same for classifier

Thank for your reply!

I found the main problems in my code at the definition of the network.

I checked the gradients manually and they are near 0 (more or less 1e-11) .

My “fast” solution was to redefine the embedder and create a new joint model JointModel and it worked perfectly.

class JointModel(nn.Module):
    def __init__(self, net, classifier):
        super(JointModel, self).__init__()
        self.emb = net
        self.clas = classifier

    def forward(self, x1, x2):
        B = images1.size(0)
        X =, x2), dim=0)
        X = self.emb(X)
        x1, x2 = X[:B, ...], X[B:, ...]
        return self.clas(x1, x2)

Now, my doubt now is why my last simple network had a problem with the gradients. (It is an AlexNet with few variants). I added the code next.

Note: I already varied the initial learning rate from 1e-1 up to 1e-4.

The network was initialized with the default parameters.

class AN(nn.Module):
    def __init__(self, sobel="RGB", bn=False, init_weights=False, norm=False):
        super(AN, self).__init__()
        self.norm = norm = bn
        assert (sobel == "RGB") or (sobel == "Edges") or (sobel == "RGB+Edges")

        self.edges = sobel
        if sobel == "Edges" or sobel == "RGB+Edges":
            grayscale = nn.Conv2d(3, 1, kernel_size=1, stride=1, padding=0)
   / 3.0)
            sobel_filter = nn.Conv2d(1, 2, kernel_size=3, stride=1, padding=1)
  [0, 0].copy_(
                torch.FloatTensor([[1, 0, -1], [2, 0, -2], [1, 0, -1]])
  [1, 0].copy_(
                torch.FloatTensor([[1, 2, 1], [0, 0, 0], [-1, -2, -1]])
            self.sobel = nn.Sequential(grayscale, sobel_filter)
            for p in self.sobel.parameters():
                p.requires_grad = False

        if sobel == "Edges":
            self.conv1 = nn.Conv2d(2, 64, kernel_size=11, padding=5, stride=4)
        elif sobel == "RGB+Edges":
            self.conv1 = nn.Conv2d(5, 64, kernel_size=11, padding=5, stride=4)
            self.conv1 = nn.Conv2d(3, 64, kernel_size=11, padding=5, stride=4)

        self.relu1 = nn.ReLU(inplace=True)
        self.maxpool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(64, 192, kernel_size=5, padding=2)
        self.relu2 = nn.ReLU(inplace=True)
        self.maxpool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv3 = nn.Conv2d(192, 384, kernel_size=3, padding=1)
        self.relu3 = nn.ReLU(inplace=True)
        self.conv4 = nn.Conv2d(384, 256, kernel_size=3, padding=1)
        self.relu4 = nn.ReLU(inplace=True)
        self.conv5 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.relu5 = nn.ReLU(inplace=True)
        self.maxpool3 = nn.MaxPool2d(kernel_size=2, stride=2)
        if bn:
            self.bn1 = nn.InstanceNorm2d(64)
            self.bn2 = nn.InstanceNorm2d(192)
            self.bn3 = nn.InstanceNorm2d(384)
            self.bn4 = nn.InstanceNorm2d(256)
            self.bn5 = nn.InstanceNorm2d(256)

        if init_weights:

    def forward(self, x):

        if self.edges == "Edges":
            x = self.sobel(x)
        elif self.edges == "RGB+Edges":
            x =, x), dim=1)

        x = self.conv1(x)
            x = self.bn1(x)
        x = self.relu1(x)
        x = self.maxpool1(x)

        x = self.conv2(x)
            x = self.bn2(x)
        x = self.relu2(x)
        x = self.maxpool2(x)

        x = self.conv3(x)
            x = self.bn3(x)
        x = self.relu3(x)

        x = self.conv4(x)
            x = self.bn4(x)
        x = self.relu4(x)

        x = self.conv5(x)
            x = self.bn5(x)
        x = self.relu5(x)
        x = self.maxpool3(x)
        x = x.view(x.size(0), -1)

        if self.norm:
            x = F.normalize(x)

        return x

    def _initialize_weights(self):
        print('Initiating network weights')
        for _, m in enumerate(self.modules()):
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                for i in range(m.out_channels):
          [i].normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
            elif isinstance(m, nn.BatchNorm2d):
            elif isinstance(m, nn.Linear):
      , 0.01)

Thanks again for your reply :slight_smile:

Hello Guillaume!
It’s a little bit late, but did you manage to figure out what was the problem in your first approach ?

I am in a similar situation.