Binary Cross Entropy in PyTorch vs Keras

Hello,

I am trying to recreate a model from Keras in Pytorch. Both use mobilenetV2 and they are multi-class multi-label problems. So I am optimizing the model using binary cross entropy. In Keras this is implemented with model.compile(..., loss='binary_crossentropy',...) and in PyTorch I have implemented the same thing with torch.nn.BCEWithLogitsLoss(). And I sending logits instead of sigmoid activated outputs to the PyTorch model. Although both models converge very similar in terms of loss (both losses goes down to 0.05 after 10 epochs), the output of PyTorch’s model is not giving good predictions. Meaning that PyTorch’s prediction are not as confident as the Keras model. Investigating this, I realized that the Keras model has a very stron logit at the index of a positive label, however the logits of the PyTorch model is very small at the index of the positive label; hence the sigmoid is not as strong. For example, for a prticualar sample that can be classified in 54 classes, the output is:

output = 
tensor([[-1.2380, -2.3283, -2.3025, -2.1275, -2.1020, -2.3684, -3.4669, -3.4503,
         -2.1905, -1.8565, -3.4215, -3.5318, -3.5715, -4.3836, -4.5215, -6.2270,
         -3.8660, -3.7280, -4.6043, -4.7601, -9.5219, -9.4969, -9.4392, -8.0596,
         -6.0773, -5.7972, -4.2495, -4.4533, -4.2641, -4.1068, -4.9987, -4.9321,
         -7.9726, -7.4475, -4.8016, -5.6634, -6.3762, -6.0103, -6.7561, -3.3259,
         -3.8778, -6.7682, -6.5663, -4.0945, -3.0747, -5.5408, -5.6429, -5.9659,
         -5.8574, -7.6435, -7.8895, -6.6514, -6.5506, -5.0583]],
       device='cuda:0')

for which the target is:

target = 
tensor([[0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
             0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
             0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.]])

You can see why the loss is still very low. Because most of the model outputs are negative and the reduction=mean for PyTorch so it produces a very small loss while the predictions are not great.

These are the things that I have tried with no success at improving the PyTorch model:

  1. changing reduction to reduction='sum'
  2. I was also suggested to use pos_weight for torch.nn.BCEWithLogitsLoss(), however the same model in Keras does not use any pos_weight and it still generates good predictions.
  3. I have tested the outcome of 'binary_crossentropy' and torch.nn.BCEWithLogitsLoss() for the same inputs and targets, but they result in the same loss value. So I don’t think there are implementation difference between the two.

Does anyone has any other suggestion or things to consider while moving from keras model to PyTorch?

I’m a bit confused about the equal losses in PyTorch and Keras, while apparently the PyTorch predictions are “less confident”. Could you store a good Keras output and calculate the loss in PyTorch for it (and vice versa)?