Questions after following transfer learning tutorial

Mark_McPherson · August 13, 2020, 10:37am

I followed the tutorial at https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#further-learning. Everything ran ok, but I have some questions I hope the community can help me with. If this belongs in a different topic, apologies!

The tutorial states that on GPU it takes less than a minute to train (15-25 mins on CPU). I ran it on an nVidia GTX Strix 1060 6GB. Not the fastest card of course, but I expected it to still run it in a matter of 2-3 minutes. However, it took between around 8.5 - 11.5 minutes, depending on the batch size I set.
I had a feeling this may be due to my old HDD that was threatening to fail at any minute! So I replaced it with an SSD, which is much faster. This knocked the training time down by 40-50%, so around 5m on average. It that a training time that would be expected with my GPU?
Using either the old HDD or a new SSD, I observed the GPU to be very under-utilised. Is this to be expected with the example in the tutorial? If so, how might one go about increasing the utilisation? If someone can give or link to some code that would be great.
I tried out different batch sizes, both to observe the running time and also to see how it affected the convergence. I had thought that the bigger the batch size, the faster a model would converge and also each epoch would run faster. Is that usually true? In this case I found that increasing the batch size in steps all the way up to all the data (it is a fairly small dataset, so it did fit on my GPU) resulted in significantly worse convergence and actually took longer. The batch size used in the tutorial was 4. I tried using 2 and found it to perform worse again. So it seems there is something special about using batch size of 4, tho there should not be AFAIK. Anyone able to give some insight into why this would be?

ptrblck · August 15, 2020, 7:42am

Depending where the bottleneck in your current system is, ~5 minutes might be expected. I just reran the script on a V100 server (which should be of course a bit more powerful) and it finished in 34s.
This post gives you a good overview about potential bottlenecks and their workarounds.
This effect is explained e.g. in Revisiting small batch training for deep neural networks, which claims that the best performance is consistently obtained for batch sizes between 2 and 32.