I’m porting a Tensorflow project to PyTorch and after executing I realize that Tensorflow is faster than PyTorch during training and it’s weird to me.
I isolate the 2 projects in 2 different notebooks with all classes and functions needed to train the model.
Remember this is just a snippet of a bigger project (a project that uses genetic algorithms) and that’s why CustomModel/TFModelConvert module needs to receive a list of arbitrary Resblock modules.
In the TensorFlow notebook, you will see that it took 58 seconds to train during 5 epochs.
In the PyTorch notebook, it took 122 seconds to train using the same layers and the same number of epochs.
My question is what can I do or what am I doing wrong for PyTorch to be slower than TensorFlow?
conv doesn’t need to be trained, it previously had a dirac_ initialization because I just want to reduce the input and don’t want to increase the number of parameters. I could use a MaxPooling but then I would need to create a zero pad tensor, which will be expensive.
But even if I want to train conv how am I going to initialize it in __init__ if I need to calculate k_size based on the input form? Is there a way to do this in PyTorch?
To test, I just removed everything from the forward function and wrote this:
Since the conv is not trained at all, you could use adaptive pooling layers instead as I don’t understand how your current approach of using new random filters could work.
In any case, since you still see a slowdown you would need to profile the code using the native profiler or e.g. Nsight Systems and check where the bottlenecks are.
Generally I would also recommend reading through the Performance Guide.
I have already changed the conv to a pooling layer since the goal is to downsample only.
Thanks for the article! I change a few things and the timing improves a bit.
In a PyTorch group, someone reminded me that PyTorch uses Dynamic Computation Graphs and Tensorflow still uses Static Computation Graphs in the background and is able to do optimizations during training. This is probably why TensorFlow is slightly faster after the first epoch.