Pytorch is (much) slower than Tensorflow

Eli_Osherovich · March 10, 2023, 3:40pm

I am new to Pytorch and while trying it I felt that familiar models run slower than in Tensorflow.

To illustrate the behavior, I built a very simple fully connected model (single layer) which is available here:

Tensorflow runs three times faster than Pytorch.

Is such a ratio expected? Do I miss something in PT?

ptrblck · March 10, 2023, 9:25pm

Your workload is tiny as the model contains a single small linear layer only and you might thus see the overheads of the kernel launches, the data loading etc.
You could try to use torch.compile with mode='reduce-overhead', which would internally use CUDA Graphs, but I doubt this would help since you are using a single layer in the end.

Eli_Osherovich · March 12, 2023, 7:10pm

Where/when does this overhead happen? If it happens every epoch (why?) it can somehow explain it. However, it it happens only once, I hardly see how the impact can be so severe over 20 epochs…

I tried compilation with different flags, but it only made things (a bit) worse…

P. S.
Is there any chance that the developers look into this and, probably, find a way to optimize PT for small problems as well?

ptrblck · March 12, 2023, 7:24pm

The data loading is performed by the DataLoader and yields a batch in each iteration. If the data loading time exceeds the model training time (which could easily be the case for a single tiny matmul) you will see this overhead in each iteration.
The kernel launch overheads are visible in all kernel launches which are triggered after PyTorch went through its dispatching logic. This overhead is visible the more tiny kernels you want to schedule and CUDA Graphs can help here.

CUDA Graphs was added for it: Accelerating PyTorch with CUDA Graphs | PyTorch

Eli_Osherovich · March 12, 2023, 7:29pm

Data loading should be identical in both cases – this is an in-memory numpy array.
Not sure if cuda graphs will be beneficial on CPU… Will they?