Tutorial line potentially slow?

penguinshin · September 26, 2017, 4:13am

In the classifier tutorial listed here: http://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

There is a line in the loop that has this:

running_loss += loss.data[0]

This line seems to be about 30% slower than doing running_loss += loss.data when running_loss is initialized as a float tensor, and I believe this is because running_loss = 0 sits on the CPU and loss sits on the GPU. Is there any downside to setting:

running_loss = torch.cuda.FloatTensor(np.zeros(1))

And having the line changed as shown above?

smth · September 27, 2017, 5:38am

there’s no downside that i can see.
However, the runtime of that particular line doesn’t matter compared to the overall compute time. it’s probably < 0.1% of time compared to forward/backward / step() etc.

If you are seeing it to be a larger chunk, that’s possibly because CUDA API is asynchronous, but that operation in particular is a synchronizing point.

penguinshin · September 27, 2017, 1:14pm

So I’m noticing that when using a 152 resnet training on batches of 20 that the total speed up per batch is about 30%. Does this signal that I’ve set something up wrong elsewhere in my code? Is there a way to avoid the synchronization step?

smth · September 27, 2017, 3:53pm

i think your “noticing 30% speedup” might be that the timing is off.
How exactly are you profiling your code to notice that speedup?

penguinshin · September 27, 2017, 4:15pm

I surround the epoch with time.time() and compare using the loss.data[0] vs keeping it in gpu tensor form

smth · September 27, 2017, 7:04pm

before each time.time() call, add torch.cuda.synchronize(). Otherwise your CUDA timings will be off.

penguinshin · September 27, 2017, 8:45pm

So your saying that the reported time elapsed is off, but that the actual time is the same? I see a noticeable difference just by observing the epochs pass, it feels like 30% longer. Or will the synchronize step actually speed up the training?

smth · September 28, 2017, 2:24am

hmm okay if the epoch time is 30% longer, then maybe the thing is a 30% overhead. But that’s super weird. i’ll take a look when i get time.