Weight loss/learning rate across sequences in same batch

Say I want to run a forward pass through the network, using a batch-size of 256, and before backprop I would want to weight the losses across the batch input/target sequences, is that possible and what would be the best practise?

It’s not a solution for me to do preproccessing and collect input/target sequences that will have the same weighted loss/learnrate together and run them in the same batch because what sequence that will be backpropagated with higher learnrate will be determined in comparison with the other sequences in the same batch, after the forward pass.

Is this a catch-22 situation that isn’t possible due to the fact the gradients doesn’t care about the actual different losses inside the same batch? There’s no way to separate different gradients, from different sequences in the same batch, apart from each other, right? Do I have to run the batch once using no_grad and then based on the result filter out/exchange sequences from the batch and then run it again using grad to simulate and approximate this?

You could use an unreduced loss via reduction='none' in the creation of the criterion and multiply the weight matrix with the loss tensor before reducing it and calling backward.

Thanks, I think I understand the logic but fail to understand what to do with the loss tensor before calling backward

In my forward function I have this now:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), reduction='none')

…here I perform my matrix mul on the loss tensor …

After that, should I do a simple loss = loss.mean() before calling loss.backward? I don’t understand how that will add a weigthed bias to the gradients? But if it works I will go for it, as simple as it sounds…

Or should I call loss[n].backward() for each batch sequence? Would that make any difference?

EDIT: Oh, nevermind. I figured that was an unneccessary question to ask when the answer was one test and therefore ~2 minutes away :slight_smile: