ResNet Slower than TensorFlow?

Hi there, I’m testing the speed-up of ResNet on TF and PyTorch.

In TF, typically it can converge within 80k steps, which is 80k batches, and when we set batch-size=128, that should be around ~205 epochs in PyTorch.

One interesting thing is, in TF I can finish 80k steps in about 6 hours. But in PyTorch, running 200 epochs took me around 13 hours. And this will expand to around 20 hours if I want to test 300 epochs.

I thought PyTorch should be much faster than TF. Does anyone knows the solution to this? BTW. I’m using ec2 g2.2xlarge.

Here is the ResNet18 PyTorch implementation
Here is the ResNet20 TF implementation

Using different batch sizes are not comparable. Using smaller batch sizes means more frequent updates and updates take time to do, so ofcourse will take longer.

I think both of them are using 128 as batch size.
And when I did the experiment, I also make sure they are the same.