Benchmarking PyTorch Tensor vs. Numpy Array Train Batch Loading

Question
Why is loading PyTorch tensors (stored as .pt files) from disk faster than loading Numpy arrays (stored as binary .npy files) when using a Dataset class, but slower when using a DataLoader? Quantified results are below.

Context
My company currently loads train batch files from disk, and asked me to profile the performance of loading data stored in numpy (.npy) vs. pytorch tensor (.pt) file formats

Experiment Details

  • 1000 batch files, 10000 rows x 200 columns, generated randomly and saved to disk in npy, parquet, and pt formats

  • Round 1: Iterate through (load + convert to PyTorch tensor) each set of 1000 batch files 30 times (num_epochs=30) using a custom Dataset class. Measure the average per-epoch load time

  • Round 2: Repeat round 1 using a Pytorch dataloader with default settings (num_workers=0)

Results:

testing file format 'numpy'...
(Dataset) per-Epoch Avg: 7.84 (std: 0.1635)
(Dataloader) per-Epoch Avg: 8.3208 (std: 0.1341)

testing file format 'tensor'...
(Dataset) per-Epoch Avg: 6.7154 (std: 0.0997)
(Dataloader) per-Epoch Avg: 8.5963 (std: 0.0877)

We thought the significant overhead associated with the tensor/Dataloader combination might be attributable to pinning the loaded tensors to our GPU memory, but that action is disabled by default.

Code

3 Likes

When you use Dataloader with default collate_fn, it will convert every numpy array to tensor before return to you. So I think that’s why Dataloader a bit slower