Serialization overhead of multiprocessing

I’m using an instance of DataLoader with num_workers > 0. I noticed that even with a small number of workers the main process becomes a bottleneck: it can’t absorb the data fast enough. A quick glance suggests that the problem is the serialization overhead between processes. Interestingly enough, the main process is CPU-bound and not I/O bound.

Are there any options to reduce the overhead of serialization? E.g. is there an option to use Apache Arrow for zero-copy data transport?

Per it seems that memory is shared already so it’s not clear to me what the main process is spending its time on. Is it just deserializing the tensor handles?