Tutorial line potentially slow?

In the classifier tutorial listed here: http://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

There is a line in the loop that has this:

running_loss += loss.data[0]

This line seems to be about 30% slower than doing running_loss += loss.data when running_loss is initialized as a float tensor, and I believe this is because running_loss = 0 sits on the CPU and loss sits on the GPU. Is there any downside to setting:

running_loss = torch.cuda.FloatTensor(np.zeros(1))

And having the line changed as shown above?

there’s no downside that i can see.
However, the runtime of that particular line doesn’t matter compared to the overall compute time. it’s probably < 0.1% of time compared to forward/backward / step() etc.

If you are seeing it to be a larger chunk, that’s possibly because CUDA API is asynchronous, but that operation in particular is a synchronizing point.

So I’m noticing that when using a 152 resnet training on batches of 20 that the total speed up per batch is about 30%. Does this signal that I’ve set something up wrong elsewhere in my code? Is there a way to avoid the synchronization step?

i think your “noticing 30% speedup” might be that the timing is off.
How exactly are you profiling your code to notice that speedup?

I surround the epoch with time.time() and compare using the loss.data[0] vs keeping it in gpu tensor form

before each time.time() call, add torch.cuda.synchronize(). Otherwise your CUDA timings will be off.

So your saying that the reported time elapsed is off, but that the actual time is the same? I see a noticeable difference just by observing the epochs pass, it feels like 30% longer. Or will the synchronize step actually speed up the training?

hmm okay if the epoch time is 30% longer, then maybe the thing is a 30% overhead. But that’s super weird. i’ll take a look when i get time.