Torch.stack 10x slower than numpy.stack in default_collate

Hi,
I have Dataset class which holds its entire data in a Numpy array in memory and returns a tuple of two Numpy arrays via getitem.
The data item returned by getitem has shape (3, 244, 244) and the target item is a 1 dimensional array.
I’m using the default Dataloader with batch size of 64 which in turn uses default_collate to arrange the batches. When using numpy.stack instead of torch.stack in the default_collate function I see a speedup of over 10 times (from ~1s to ~0.1s per batch) in the collate function.

Specifically, when changing line 68 in utils/data/_utils/collate.py from

return default_collate([torch.as_tensor(b) for b in batch])

to

return torch.as_tensor(numpy.stack(batch))

How is that?

I think a performance difference with these two implementations is expected, as converting to tensors before stacking introduces a second iteration step in Python (which will be slower than doing everything in a single native function like numpy.stack). However, I can’t reproduce the 10x difference (it might be because my environment or timing) is somewhat different.

If you have a small code snippet that shows the 10x difference it might be easier to see if there is another issue, but as is I would expect converting to tensors after stacking in numpy to be faster.