Dataloader is slow with sampling TensorDataset

It looks like the standard dataloader is pretty slow when we have small datasets that can fit in (CPU/GPU) memory. As an example, consider the following:

data_matrix = torch.randn(2000, 1000, 4)
# 2000 samples each with [1000x3] features and 1000-length targets 
dataset = TensorDataset(data_matrix[..., :-1], data_matrix[..., -1])
dataloader = DataLoader(dataset, batch_size=128, shuffle=True, pin_memory=True)

When I measure the time for sampling from the dataloader, it takes around 3-5 ms:

%%time
data_iter = iter(dataloader)
features_loader, targets_loader = next(data_iter)
print(features_loader.shape, targets_loader.shape)

But directly accessing the samples is much faster at 1-2 ms:

%%time
indices = torch.randint(0, 2000, (128,))
features_direct, targets_direct = data_matrix[indices][..., :-1], data_matrix[indices][..., -1]
print(features_direct.shape, targets_direct.shape)

Is there a way to write some custom dataloader which can sample as fast? Increasing num_workers in dataloader is making it way worse (more than 40 ms). I want to use the DataLoader class because it can be more easily integrated with other workflows.