I implemented the same autoencoders with BCE criterion using pytorch and torch7.
When I tried to train them using SGD with momentum, the convergence speeds were almost same within about 2000 iterations. But, the convergence of pytorch becomes quite slower than that of torch7.
Has anyone compared the convergence performances ?
I can’t say that they’re equivalent for sure, because I can’t see where avg comes from in the Lua version, but hey look the same to me. As @smth said, we’ve tested convergence on many tasks, and we can’t help you with the limited information we have. I’d suggest to look for NaNs or infs in grads. You might also want to compare the gradients you’re getting from both versions.
No, but these values can destabilize or break the weights of the model. It would be better to use g.grad.data to get a flattened tensor of all gradients instead of a Variable.