Returning a Tensor in a multiprocess DataSet is extremely slow

Returning a Tensor is orders of magnitudes (60 times in this example) slower than returning something like a numpy array in a DataSet when used with a DataLoader with num_workers>0.
This is the case even if the Tensor happens to share memory with a numpy array (i.e. from_numpy), but does not seem to happen when we don’t use any workers.

How can I deal with this issue? (the code is run on Linux)

Some example:

import torch
import numpy as np

class MyDataSet(torch.utils.data.Dataset):
    def __init__(self):
        super().__init__()

    def __len__(self):
        return 100000

    def __getitem__(self, idx):
        arr = np.arange(250)
        tensor = torch.arange(250)

        # return arr # 1s
        return tensor # 60s

def collate_wrapper(batch):
    return batch

ds = MyDataSet()
data_loader = torch.utils.data.DataLoader(
    ds,
    num_workers=2,
    shuffle=False,
    batch_size=64,
    collate_fn=collate_wrapper,
    prefetch_factor=1,
)

c = 0
for _ in data_loader:
    c+=1
print(c)