Serialization overhead of multiprocessing

I’m using an instance of DataLoader with num_workers > 0. I noticed that even with a small number of workers the main process becomes a bottleneck: it can’t absorb the data fast enough. A quick glance suggests that the problem is the serialization overhead between processes. Interestingly enough, the main process is CPU-bound and not I/O bound.

Are there any options to reduce the overhead of serialization? E.g. is there an option to use Apache Arrow for zero-copy data transport?

Per https://pytorch.org/docs/master/notes/multiprocessing.html it seems that memory is shared already so it’s not clear to me what the main process is spending its time on. Is it just deserializing the tensor handles?