Benchmarking PyTorch Tensor vs. Numpy Array Train Batch Loading

JakeColor · April 16, 2021, 2:33pm

Question
Why is loading PyTorch tensors (stored as .pt files) from disk faster than loading Numpy arrays (stored as binary .npy files) when using a Dataset class, but slower when using a DataLoader? Quantified results are below.

Context
My company currently loads train batch files from disk, and asked me to profile the performance of loading data stored in numpy (.npy) vs. pytorch tensor (.pt) file formats

Experiment Details

1000 batch files, 10000 rows x 200 columns, generated randomly and saved to disk in npy, parquet, and pt formats
Round 1: Iterate through (load + convert to PyTorch tensor) each set of 1000 batch files 30 times (num_epochs=30) using a custom Dataset class. Measure the average per-epoch load time
Round 2: Repeat round 1 using a Pytorch dataloader with default settings (num_workers=0)

Results:

testing file format 'numpy'...
(Dataset) per-Epoch Avg: 7.84 (std: 0.1635)
(Dataloader) per-Epoch Avg: 8.3208 (std: 0.1341)

testing file format 'tensor'...
(Dataset) per-Epoch Avg: 6.7154 (std: 0.0997)
(Dataloader) per-Epoch Avg: 8.5963 (std: 0.0877)

We thought the significant overhead associated with the tensor/Dataloader combination might be attributable to pinning the loaded tensors to our GPU memory, but that action is disabled by default.

Code

gist.github.com

https://gist.github.com/JakeColor/bb403a096f3c198f31c5a55f420ea738

benchmark_iteration_speed.py

def run_one_epoch(data, conn):
    epoch_start = time.time()
    for batch in data:
        pass
    epoch_end = time.time()

    conn.send(round(epoch_end - epoch_start, 6))


def measure_data_load_speed(data, num_epochs):

This file has been truncated. show original

Le_Tr_ng_Giang · April 17, 2023, 3:17am

When you use Dataloader with default collate_fn, it will convert every numpy array to tensor before return to you. So I think that’s why Dataloader a bit slower