Model gives same Prediction for all data in one batch

RWi · May 9, 2020, 6:04pm

Hi.

My GraphNet predicts for all events in one batch the same result. This output is about the average of all labels within the batch. But I have an optimization problem and my labels are pretty unique.
The solution is easy, changing the batchsize to 1. Now my model trains just fine and I can reach a good accuracy. Still I want to go back to train on batches to decrease the training time.

I already did optimization on learning rate, width and depth of my model, Loss Function, activation function, dropout, and pooling. And for batchsize 1 the training results are fine.

Here the Code of my model:

class Net(nn.Module):
    def __init__(self, n_feats_fc, in_feats_g, parallel_layers, Dropout):
        super(Net, self).__init__()
        self.edge1 = EdgeConv(50, 100)
        self.edge2 = EdgeConv(100, 300)
        self.edge3 = EdgeConv(300, 600)
        self.Dropout = nn.Dropout(Dropout)
        self.pooling = MaxPooling()
        self.fc1 = nn.Linear(600,200)
        self.fc2 = nn.Linear(200,200)
        self.fc3 = nn.Linear(200,200)
        self.fc4 = nn.Linear(200,200)
        self.fc_out = nn.Linear(200, 9)

    def forward(self, graph, batch_size):
        feat = torch.tanh(self.edge1(graph ,graph.ndata['x']))
        feat = self.Dropout(feat)
        feat = torch.tanh(self.edge2(graph, feat))
        feat = torch.tanh(self.edge3(graph, feat))
        #feat = torch.max(feat, dim = 1)[0]
        feat = self.pooling(graph, feat)
        feat = torch.tanh(self.fc1(feat))
        feat = self.Dropout(feat)
        feat = torch.tanh(self.fc2(feat))
        feat = torch.tanh(self.fc3(feat))
        feat = torch.tanh(self.fc4(feat))
        out = torch.clamp(self.fc_out(feat), min=-2, max=2)
        del  feat
        gc.collect()
        return out, graph

I use tanh, because my output is in range [-2,2]. I have also a custom Loss Function, but it neigther worked with MSELoss.

Here my Loss:

class MyLoss(_Loss):
 __constants__ = ['reduction']
    def __init__(self):
        super(MyLoss, self).__init__()

    def forward(self, pred, tru):
        loss = torch.tensor([0])
        for p,t in list(zip(pred, tru)):
            l = .... #loss calculation
            loss = torch.add(loss,l.item())
        loss.requires_grad = True
        return l

Maybe it is a back propagation problem. But I don’t understand why it doesn’t work.
I hope anyone can help me.

ptrblck · May 10, 2020, 7:05am

Could you check, if the parameters if your model get valid gradients?
The custom loss function seem to detach the calculated loss via l.item(), so that your model shouldn’t get gradients at all.

RWi · May 12, 2020, 2:08pm

Thanks for the quick response. I fixed the loss function and checked it with torch.autograd.gradcheck(). So the loss function should work fine now. Still the predictions are the same per batch.
I predict 9 floats, which should be in range [-2,2].
Here one Prediction of batchsize 2:
tensor([[ 0.0959, -0.1443, 2.0000, 0.0764, -0.0829, 1.6607, -0.1028, -0.0421, 0.9389],
[ 0.1076, -0.0852, 2.0000, 0.0933, -0.0665, 1.6058, -0.1053, -0.0624, 0.8904]], grad_fn=)

Is there any possibility to implement in the loss function, that the predictions in on batch need to be different?

ptrblck · May 12, 2020, 11:02pm

The grad_fn seems to be empty, so have you verified that the model parameters get valid gradients?
You could use this quick check after calling the backward() function:

for name, param in model.named_parameters():
    print(name, param.grad.abs().sum())

RWi · May 13, 2020, 8:12am

Thanks.
Here the ouput from the quick check:

edge1.theta.weight tensor(843070.)
edge1.theta.bias tensor(19107.6387)
edge1.phi.weight tensor(799883.0625)
edge1.phi.bias tensor(19107.6387)
edge2.theta.weight tensor(3371751.)
edge2.theta.bias tensor(21775.7812)
edge2.phi.weight tensor(2772996.7500)
edge2.phi.bias tensor(21775.7812)
edge3.theta.weight tensor(8904098.)
edge3.theta.bias tensor(46138.9766)
edge3.phi.weight tensor(9298737.)
edge3.phi.bias tensor(46138.9766)
fc1.weight tensor(18413108.)
fc1.bias tensor(46069.2031)
fc2.weight tensor(5211814.)
fc2.bias tensor(7669.1782)
fc3.weight tensor(7649343.)
fc3.bias tensor(5121.9238)
fc4.weight tensor(8092313.)
fc4.bias tensor(5976.3550)
fc_out.weight tensor(1983731.)

These values change during the training.

ptrblck · May 14, 2020, 4:24am

Thanks for the update. It seems your model’s parameters get valid gradients.
I would recommend to play around with some hyperparameters, such as lowering the learning rate.
If that doesn’t help, try to scale down the problem and try to overfit a small data sample (e.g. only 10 samples). Once your model is able to do so, you could try to scale it up again carefully.

CallumJMac · February 8, 2022, 4:06pm

Hi - I also have the same issue. I am using a CNN in a regression task to predict the age of a person based on an image of their face. Essentially what happens is the network will predict the mean of the ground truth labels for a given batch.

So for example, when overfiting the model on a single batch, batch_size = 3 and the ground truth labels = [0.5, 0.7, 0.3] then the network will predict [0.5, 0.5, 0.5] once the loss has converged.

This seemed to me that the model was underfitting the data. So I progressively added more and more linear layers to the regression layers of the CNN (with a ResNet feature extractor). This did not change the outcome…

I have tried using different optimizers and learning rates in the range of 1E-9 to 1E-1 and this did not seem to help either.

Another blog post suggested using Mean Squared Log-scaled Error Loss rather than just using MSE. This did not help either. machine learning - Training a neural network for regression always predicts the mean - Cross Validated

I have implemented regression problems with CNNs in TensorFlow before and did not encounter the same issues. As far as I can tell, there is no reason that this should not work in PyTorch too!

Any further suggestions?

Image_Scientist · July 17, 2022, 11:31pm

I just wanted to say thank you for the post, and I am seeing the same problem in an unrelated age prediction task (fish). The prediction changes for each batch, but within a batch every prediction is the same.

RWi · July 19, 2022, 6:27am

Hi Mike,
If I remember correctly I had a problem with the mini batch training in this case. As I am using graph convolution I need to use the batched graphs and batched data.In my opinion the documentation does not help much on how to use it. Do you also use Graph Convolution and mini batch training?

Did you check your predictions for batch size 1?

Image_Scientist · July 20, 2022, 1:47am

Ah, I did not see convergence with batch size 1, but I also have a very different setup, a basic resnet solving a regression problem (different head) predicting age.

RWi · July 20, 2022, 7:24am

I would drastically decrease the complexity of your model to check for convergence. From there you could increase the complexity again. If you can not see any convergence you probably have issues in the loss function or back propagation.

Edward_Ferdian · November 13, 2023, 1:18pm

Got similar issue for a regression problem using CNN. It works in Keras, but not in Torch. Followed other suggestions such as learning rate, batch size, normalization etc.

Turns out Torch is using Lecun initialization by default. For relu, it is best to use Kaiming He initialization and it solved the problem for me.

self.conv1 = nn.Conv3d(in_channels, out_channels, kernel_size=3, padding=1)
nn.init.kaiming_normal_(self.conv1.weight)

Also refer to this: Don’t Trust PyTorch to Initialize Your Variables | Aditya Rana Blog