This line seems to be about 30% slower than doing running_loss += loss.data when running_loss is initialized as a float tensor, and I believe this is because running_loss = 0 sits on the CPU and loss sits on the GPU. Is there any downside to setting:
there’s no downside that i can see.
However, the runtime of that particular line doesn’t matter compared to the overall compute time. it’s probably < 0.1% of time compared to forward/backward / step() etc.
If you are seeing it to be a larger chunk, that’s possibly because CUDA API is asynchronous, but that operation in particular is a synchronizing point.
So I’m noticing that when using a 152 resnet training on batches of 20 that the total speed up per batch is about 30%. Does this signal that I’ve set something up wrong elsewhere in my code? Is there a way to avoid the synchronization step?
So your saying that the reported time elapsed is off, but that the actual time is the same? I see a noticeable difference just by observing the epochs pass, it feels like 30% longer. Or will the synchronize step actually speed up the training?