Performance: PyTorch vs Theano/TF

Training a simple 2 hidden-layer FC net in Theano and TF seems to be much faster than in PyTorch.
Can you give me a fast MNIST single-GPU PyTorch example?

Edit: I think the problem is that I’m on Windows and I’m using an unofficial porting of PyTorch.