Hello,
I am trying to recreate a model from Keras in Pytorch. Both use mobilenetV2 and they are multi-class multi-label problems. So I am optimizing the model using binary cross entropy. In Keras this is implemented with model.compile(..., loss='binary_crossentropy',...)
and in PyTorch I have implemented the same thing with torch.nn.BCEWithLogitsLoss()
. And I sending logits instead of sigmoid activated outputs to the PyTorch model. Although both models converge very similar in terms of loss (both losses goes down to 0.05 after 10 epochs), the output of PyTorch’s model is not giving good predictions. Meaning that PyTorch’s prediction are not as confident as the Keras model. Investigating this, I realized that the Keras model has a very stron logit at the index of a positive label, however the logits of the PyTorch model is very small at the index of the positive label; hence the sigmoid is not as strong. For example, for a prticualar sample that can be classified in 54 classes, the output is:
output =
tensor([[-1.2380, -2.3283, -2.3025, -2.1275, -2.1020, -2.3684, -3.4669, -3.4503,
-2.1905, -1.8565, -3.4215, -3.5318, -3.5715, -4.3836, -4.5215, -6.2270,
-3.8660, -3.7280, -4.6043, -4.7601, -9.5219, -9.4969, -9.4392, -8.0596,
-6.0773, -5.7972, -4.2495, -4.4533, -4.2641, -4.1068, -4.9987, -4.9321,
-7.9726, -7.4475, -4.8016, -5.6634, -6.3762, -6.0103, -6.7561, -3.3259,
-3.8778, -6.7682, -6.5663, -4.0945, -3.0747, -5.5408, -5.6429, -5.9659,
-5.8574, -7.6435, -7.8895, -6.6514, -6.5506, -5.0583]],
device='cuda:0')
for which the target is:
target =
tensor([[0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.]])
You can see why the loss is still very low. Because most of the model outputs are negative and the reduction=mean
for PyTorch so it produces a very small loss while the predictions are not great.
These are the things that I have tried with no success at improving the PyTorch model:
- changing reduction to
reduction='sum'
- I was also suggested to use
pos_weight
fortorch.nn.BCEWithLogitsLoss()
, however the same model in Keras does not use anypos_weight
and it still generates good predictions. - I have tested the outcome of
'binary_crossentropy'
andtorch.nn.BCEWithLogitsLoss()
for the same inputs and targets, but they result in the same loss value. So I don’t think there are implementation difference between the two.
Does anyone has any other suggestion or things to consider while moving from keras model to PyTorch?