Pytorch performance concern

Since Pytorch is in Python, inevitably the code will involve some Python overhead compared C++ code.

For example, with the data loading utility from Pytorch, we could code up a data-loader to load our data, be it text or image etc.
But what I observed is this Python data-loader is unbelievably slow than my C++ loader, since my dataset is huge, I cannot load it all into memory,so I have to iterate this dataset chunk-by-chunk, one by one epoch.

The slow data-loading in Python is un-tolerable, which also makes me think even if I fully load the data into memory, the network training Python code will also bring in substantial overhead.

BTW, in order to speed up the data-loading, should I discard this Python data-loader, and write Py-wrapper over my C++ loader?

Have you tried a few settings for the DataLoader, e.g. changing num_workers, using pin_memory etc.?

If so, what kind of data do you have and how large is the difference in loading time?

Not yet, will have a try.

But still wonder, to match my C++ loading performance, Python DataLoader would have to spare more than one workers or even set these advanced options such as pin_memory?

Well, I don’t know, what kind of data you are dealing with, and how you are loading it.
Based on the type of data, you could probably tune the Python code even more.

Using multiple processes to load your data gives you the advantage, that the next batch might be ready waiting for the GPU to finish its current operation.
I’m not sure, how you are dealing with this in your C++ code, but if it’s sequential, you’ll have to wait for the data loading to finish, before you can push it onto the GPU, which might kill your speed advantage.

Also, with pin_memory=True you’ll save some unnecessary copying between pageable and pinned memory, when you would like to use the GPU.

1 Like

Well, I couldn’t help asking this, my textual data is structured, and no matter the loading is in C++ or Pytorch's Dataset, the loaded lines have to be parsed.
As tested, C++ loading (single thread) is much faster (like 10x) than the Python data-loader (single worker), maybe increase num_worker could help some, but do you think if it’s advisable to just plugin the C++ loader (py wrapped) into my pipeline?

Maybe it’ll give you some performance advantage, but you should definitely check, if it’s the bottleneck in your application.
It’s always annoying to fine tune some methods when the bottleneck is somewhere else.
You could have a look at torch.utils.bottleneck to check your model code.

To time your DataLoader you could have a look at the imagenet example.

Also, how are you currently loading your text data in pytorch? It might help to save it as a hdf5 file.


Is it acceptable for you to implement a C++ parser that plugs to python dataloader? If so, look at How to use dataset larger than memory?. The ChunkDataset API might help you and there is python bindings on the way