Pytorch is (much) slower than Tensorflow

I am new to Pytorch and while trying it I felt that familiar models run slower than in Tensorflow.

To illustrate the behavior, I built a very simple fully connected model (single layer) which is available here:

Tensorflow runs three times faster than Pytorch.

Is such a ratio expected? Do I miss something in PT?

Your workload is tiny as the model contains a single small linear layer only and you might thus see the overheads of the kernel launches, the data loading etc.
You could try to use torch.compile with mode='reduce-overhead', which would internally use CUDA Graphs, but I doubt this would help since you are using a single layer in the end.

Where/when does this overhead happen? If it happens every epoch (why?) it can somehow explain it. However, it it happens only once, I hardly see how the impact can be so severe over 20 epochs…

I tried compilation with different flags, but it only made things (a bit) worse…

P. S.
Is there any chance that the developers look into this and, probably, find a way to optimize PT for small problems as well?

The data loading is performed by the DataLoader and yields a batch in each iteration. If the data loading time exceeds the model training time (which could easily be the case for a single tiny matmul) you will see this overhead in each iteration.
The kernel launch overheads are visible in all kernel launches which are triggered after PyTorch went through its dispatching logic. This overhead is visible the more tiny kernels you want to schedule and CUDA Graphs can help here.

CUDA Graphs was added for it: Accelerating PyTorch with CUDA Graphs | PyTorch

Data loading should be identical in both cases – this is an in-memory numpy array.
Not sure if cuda graphs will be beneficial on CPU… Will they?