Creating a custom GPU DataLoader

I was wondering whether there was any good guide on best practices for creating my own DataLoader. To be clear: I’m not looking to create a custom Dataset, and am not looking to create a DataLoader subclass that adds features. I’m looking to build, from the ground-up, my own DataLoader replacement.

For single-process loading, this is easy. But for multi-process loading, it gets extremely hairy. I keep running into issues. Here’s a few examples of things I ran into:

  • Sending big tensors across pipes is extremely slow. (I eventually discovered that I needed to use torch.multuprocessing.Queue instead of Python’s built-in Queue.)
  • Calling .pin_memory() on my worker process causes torch to initiaize CUDA and load CUDA libraries, which has a bunch of GPU memory overhead; this means that I run out of memory whenever I have more than a few workers. (Haven’t solved this one yet…)

The torch DataLoader class doesn’t have these issues, but the code is quite impenetrable, and it has not been very helpful in guiding me toward what I need to do. Are there any simple tutorials for how to do this? Or could anybody offer me some guidance?