Convergence speed difference compared to that of Torch7

I implemented the same autoencoders with BCE criterion using pytorch and torch7.

When I tried to train them using SGD with momentum, the convergence speeds were almost same within about 2000 iterations. But, the convergence of pytorch becomes quite slower than that of torch7.

Has anyone compared the convergence performances ?

We’ve compared converegence on a variety of tasks: word language models, convnets for classification, GANs, etc.
Must be something subtle.

I also tried rmsprop with batch normalization, the same things were observed.

The below is a modified rmsprop employed in Torch7.

square_avg:mul(alpha)
square_avg:addcmul(1.0-alpha, grad, grad)
avg:sqrt(square_avg+eps)
params:mul(1-lr*weight_decay):addcdiv(-lr, grad, avg)

And, the below is an equivalent rmsprop in pytorch

square_avg.mul_(alpha).addcmul_(1 - alpha, grad, grad)
avg = square_avg.add(group['eps']).sqrt()
p.data.mul_(1-group['lr']*weight_decay).addcdiv_(-group['lr'], grad, avg)

Are they equivalent ?

Also, my pytorch code didn’t work at all before 0.19 version due to some run time errors. Is that related to this convergence difference ?

1 Like

I can’t say that they’re equivalent for sure, because I can’t see where avg comes from in the Lua version, but hey look the same to me. As @smth said, we’ve tested convergence on many tasks, and we can’t help you with the limited information we have. I’d suggest to look for NaNs or infs in grads. You might also want to compare the gradients you’re getting from both versions.

Thanks for your reply.

Is it possible that the convergence becomes slower even if some gradients becomes NaNs or infs ?

My model is just autoencoder consisting of convolution, maxpooling, maxunpooling, relu, and sigmoid.

Anyway, to compare the gradients, I found that the gradients can be saved in a flatted form by

torch.cat([g.grad.view(-1) for g in model.parameters()],0)

No, but these values can destabilize or break the weights of the model. It would be better to use g.grad.data to get a flattened tensor of all gradients instead of a Variable.