Dataloader with num_workers = 4 slower than default?

Hello,

I am using a dataloader with a custom dataset and batch sampler. I just tried setting num_workers=4 to see whether it speeds up, but it dramatically slows down. My system is Windows and I am running from a jupyter notebook.

In my dataloader, I use a collate_function that creates tensors like:
torch.tensor(..., device='cuda'). Could this be the reason? That there is excessive copying between cuda and cpu or sth like that?

Any ideas?

Best, JZ