Hi there, I’m testing the speed-up of ResNet on TF and PyTorch.
In TF, typically it can converge within 80k steps, which is 80k batches, and when we set batch-size=128, that should be around ~205 epochs in PyTorch.
One interesting thing is, in TF I can finish 80k steps in about 6 hours. But in PyTorch, running 200 epochs took me around 13 hours. And this will expand to around 20 hours if I want to test 300 epochs.
I thought PyTorch should be much faster than TF. Does anyone knows the solution to this? BTW. I’m using ec2 g2.2xlarge
.
Here is the ResNet18 PyTorch implementation
Here is the ResNet20 TF implementation