Loading large samples with DataLoader

Hello folks,
I’m training a network with very large input samples, something along the lines of a tensor with (48, 10, 448, 448) dimensions, and I’m seeing a big slowdown in throughput when using num_workers > 0.
I’m exactly sure if I’m correct, but I think it’s because the tensors are being copied from the worker processes to the main process and this is a huge bottleneck. I’ve look around online but haven’t found any discussions on this topic. Is there a way to avoid this copy, or a recommendation on how to mitigate this problem in any other way? I appreciate the help.