Network doesn't converge for a single input

I am trying to reproduce a neural network for shape and appearance disentangling [https://arxiv.org/pdf/1903.06946.pdf]. The network was written in Tensorflow and I want to write it in PyTorch. The Model looks as follows:

class Model(nn.Module):
    def __init__(self, parts=16, n_features=32):
        super(Model, self).__init__()
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.E_sigma = E(3, parts, residual_dim=64, sigma=True)
        self.E_alpha = E(1, n_features, residual_dim=64, sigma=False)
        self.decoder = Decoder(parts, n_features)

    def forward(self, x):
        sig, stack = self.E_sigma(x)
        f_xs = self.E_alpha(stack)
        alpha = get_local_part_appearances(f_xs, sig)
        mu, L_inv = get_mu_and_prec(sig, self.device)
        encoding = feat_mu_to_enc(alpha, mu, L_inv, self.device)
        reconstruction = self.decoder(encoding)
        return reconstruction

The model consists of three nn.modules, which are E_sigma, E_alpha and Decoder. As an example, it looks as follows:

class E(nn.Module):
    def __init__(self, depth, n_out, residual_dim, sigma=True):
        super(E, self).__init__()
        self.sigma = sigma
        self.hg = Hourglass(depth, residual_dim)  # depth 4 has bottleneck of 4x4
        self.n_out = Conv(residual_dim, n_out, kernel_size=1, stride=1, bn=True, relu=True)
        if self.sigma:
            self.preprocess_1 = Conv(3, 64, kernel_size=6, stride=2, bn=True, relu=True)  # transform to 64 x 64 for sigma
            self.preprocess_2 = Residual(64, residual_dim)
            self.map_transform = Conv(n_out, residual_dim, 1, 1)    # channels for addition must be increased

    def forward(self, x):
        if self.sigma:
            x = self.preprocess_1(x)
            x = self.preprocess_2(x)
        out = self.hg(x)
        map = self.n_out(out)
        if self.sigma:
            map_normalized = F.softmax(map.reshape(map.size(0), map.size(1), -1), dim=2).view_as(map)
            map_transform = self.map_transform(map_normalized)
            stack = map_transform + x   # Why not stack? x is much larger than map_transform, so it is almost no impact
            return map_normalized, stack
        else:
            return map

Also there are three functions, which transform the data. I am trying to test, whether the model is converging if I just feed it with a single input and let it train for some time:

def train():
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    net = Model().to(device)
    net.train()
    optimizer = torch.optim.Adam(net.parameters(), lr=1e-4)
    criterion = nn.MSELoss().to(device)
    img = torch.randn(1, 3, 128, 128).to(device)
    for epoch in range(1000):
        optimizer.zero_grad()
        prediction = net(img)
        loss = criterion(prediction, img)
        loss.backward()
        optimizer.step()
        if epoch % 10 == 0:
            print(loss)

The model has about 12 million parameters and unfortunately for that single input, the loss always converges at about 0.5. Is that a sign, that there is a problem with my architecture? What could be the reason for that?

Small update: I figured out, that the network makes all output values to zero - how can that be?!

Edit: Got it - The model finished with a sigmoid activation function, while the input was a non-normalized image :slight_smile: