Line graph of loss

OK, that explains the jumps indeed :). For the training, any normalization of the loss with a constant (here len(xA and len(data_loader.dataset) shouldn’t better. Sure it affects the absolute values of the loss but not the gradients for training.

I’m not aware of Sesame Networks? Do you mean a Siamese Network? I get the idea behind this network architecture but only dabbled once a bit just for some practice.

When you say similarity between two sentences, I assume you refer to semantic similarity – well otherwise you wouldn’t need a Deep Learning model, I guess :). I would argue that this is a very challenging tasks (well, like most NLP task really) since language is extremely flexible and can be very subtle. Even a small change in a sentence can completely alter it’s meaning.

That’s why I never tried data augmentation with text. With images it seems straightforward. Rotating, slight shifting or cropping etc. will still show the same object for classification. I don’t see how to directly map this to text. But that’s just my thought, and I’m anything but an expert!!!