I have one question that I want to make sure about it.
My loss function is defined on the aggregated output of all batches. It means I don’t have a loss function to evaluate the output of each batch. I aggregate the output of all batches to generate the final output and I have a cross-entropy on the final output.
Does backpropagation work properly by loss.backward and optimizer.step?
Answering you question, yes, backpropagation work properly for this scenario you described.
But I also agree with @tom, it could require a considerable amount of memory.
To reduce the amount of required memory, you could use the gradient accumulation technique.
The good news is that it’s quite simple to implement it in Pytorch, you just need to call loss.backward() as many times you want (in your case after all the batches that you want to accumulate), then when you want to update the model weights, just call optimizer.step(). If you’re going to repeat this process, don’t forget to use model.zero_grad() after the step function
Sorry, I think my question was not clear enough. let’s say I have a minibatch with size B * C * H * W. Then my output size is B * Cp * Hp * Wp. the output of Regular cross-entropy is B*1. Then we get for example average of errors to have the final error for backpropagation.
In my case, On the OUTPUT batch B * Cp * Hp * Wp, I want to first do an averaging on all images in the current mini-batch (OUTPUT.mean(dim=0)), then OUTPUT will be Cp * Hp * Wp. Finally, I apply my cross-entropy loss to the final output.
The reason is, I don’t have supervision on each image of the mini-batch. in my case Cp=2,Hp=1,Wp=1.