TorchData performance

Hi there!

I was just looking for informations.
TensorFlow has a tf.data API which compiles to static graphs, thus building highly optimized compiled data pipelines, because Python is slow.

What does TorchData do? Is it simply a Python wrapper or is there really an interest in terms of performances, compared to TensorFlow?

What about a comparison with NVIDIA Dali?

Thanks!

Nvidia Dali is faster. Basically has queues, runs everything in the background etc…
Also they optimized typicall data pipelines (like gpu-decoding, cropping, resizing).
It’s limited but if it fits your necessities it’s the fastest.

On the other hand, pytorch has its policy of “be flexible”. Pytorch dataloader just lauch multiprocessing (at least the last time i checked) and relies on user’s skills to improve the speed. Basically provides boilerplate code to make batches, convert stuff to tensors and so on. You can debug the code as it’s pure python.

What is better?
Well that’s very user dependent. I tried to use dali and it’s wonderful when your pipeline is straight-forward. It’s painful when you would like to inspect inner ops or do your own.

But in the end, benchmarks doesn’t show TF is faster than Pytorch despite everything is compiled, thus, I wouldn’t say it’s better atm.

Thanks for the reply!
It is true that Dali has a huge performance boost over TF.
I guess it is possible to implement custom augmentation after converting the Dali Dataset to a TF Dataset, by calling .map() on it.
Do you have any source for the benchmarks you were talking about?

It’s difficult to keep track of the 2022 results.
For example here you can check the results:

and both forward times are the same roughly.

So in the end it’s about the dataloaders and I don’t really thing pytorch is much worse on that side. Tensorflow just provides tools which are optimized for people who is not expert in a given field. To achieve the same performance with pytorch you do need to get into different libraries and understand which one to use in which case. For example using librosa is much slower than scipy, but more versatile. Loading video is not straight-forward and you need to get very good libraries not to bottleneck the dataloading.

In that sense, PyTorch has always been a bit greenish as it provide less tools. In the end that side relies on the programmer. A good one will do a good job.