OK, that explains the jumps indeed :). For the training, any normalization of the loss with a constant (here len(xA
and len(data_loader.dataset)
shouldn’t better. Sure it affects the absolute values of the loss but not the gradients for training.
I’m not aware of Sesame Networks? Do you mean a Siamese Network? I get the idea behind this network architecture but only dabbled once a bit just for some practice.
When you say similarity between two sentences, I assume you refer to semantic similarity – well otherwise you wouldn’t need a Deep Learning model, I guess :). I would argue that this is a very challenging tasks (well, like most NLP task really) since language is extremely flexible and can be very subtle. Even a small change in a sentence can completely alter it’s meaning.
That’s why I never tried data augmentation with text. With images it seems straightforward. Rotating, slight shifting or cropping etc. will still show the same object for classification. I don’t see how to directly map this to text. But that’s just my thought, and I’m anything but an expert!!!