Why is Dataloader faster than simply torch.cat() on Dataset?

I have a Dataset named unlabeled_set and a corresponding DataLoader named unlabeled_loader. To get batched data, I know that I can do the following on DataLoader.

unlabeled_loader = DataLoader(unlabeled_set, batch_size=batch_size, shuffle=True)

for img,_ in tqdm(unlabeled_loader):
    out=model(img.to(device))

which runs in the speed of 1.2it/s (that is, 0.83s/it) by tqdm.

However, if I just iterate the the dataset and use torch.cat() to get the batched data like that:

for _ in tqdm(range(53)):
    in = torch.as_tensor([])
    for j in range(128):
        in = torch.cat((in, torch.unsqueeze(unlabeled_set[_ * 128 + j][0], 0)), 0)
    out = model(in.to(device))

where 128 is the batch_size and 53 is (dataset_len / 128) . This way runs in the speed of 1.55s/it, much slower.

So my question is: Why is Dataloader faster? How to modify the latter to make it faster?

Not sure what your batch size is, but that sounds incredibly slow. Are you using Windows?

If so, Windows has some issues with parallelizing workers, as mentioned here:

If your dataset can fit into cpu memory or another gpu, I suggest putting all your data into memory. With this loader, I went from several seconds a batch to a fraction of a second. Here is an example:

import torch
import numpy as np
from numpy.random import default_rng


class FastLoader():
    def __init__(self, dataset, labels, batch_size, device, testing=False):
        self.length = dataset.size()[0]
        self.dataset = dataset
        self.labels = labels
        self.indexer = np.arange(self.length)
        self.testing = testing
        self.batch_size = batch_size
        self.max_idx = self.length // batch_size + 1
        self.fin_batch_len = self.length - self.length // batch_size

        self.device = device

        if not testing:
            self.shuffler()

    def shuffler(self):
        rng = default_rng()
        rng.shuffle(self.indexer)

    def __len__(self):
        return self.length

    def get_batch(self, idx):
        if idx == self.max_idx:
            mini_idx = self.indexer[idx * self.batch_size:idx * self.batch_size + self.fin_batch_len]
        else:
            mini_idx = self.indexer[idx * self.batch_size:idx * self.batch_size + self.batch_size]

        data = self.dataset[mini_idx, ...]
        labels = self.labels[mini_idx, ...]

        # preprocessing goes here | make sure any tensors made are cast to self.device
        
        return data, labels


# Usage Example
device = torch.device("cpu")
A = torch.rand((10000, 3, 32, 32), device=device)
labels = torch.rand(10000, device=device)
trainloader = FastLoader(A, labels, batch_size=128, device=device)

for idx in range(trainloader.max_idx):
    data, labels = trainloader.get_batch(idx)
    print(data.size(), labels.size())
    
# You can reshuffle the trainloader indices between epochs with:
trainloader.shuffler()

With the above, all you need to feed it is a tensor of the whole dataset. It should have the dims you use in each batch such as (batch_size, channels, dims, dims)

import glob
import pandas as pd


def data2gputens(path, device, data_dim):
    out = torch.empty((0, *data_dim), device=device, requires_grad=False, dtype=torch.float32)
    start_column, stop_column = 1, 6
    for p_idx, fname in enumerate(glob.glob(path + "*.csv")):
        data_file = pd.read_csv(fname, header=None)
        out = torch.cat([out, torch.tensor(data_file.iloc[:, start_column:stop_column].values, device=device,
                                           requires_grad=False, dtype=torch.float32).view(-1, *data_dim)])
    return out

# Usage of data2gputens :
directory = "data/"
device = torch.device("cpu")
data_dim = (95, 5)
train_dataset = data2gputens(directory, device, data_dim)

Of course, you’ll have to adjust the above definition to your data’s size, whether they be images, sequences, etc. and where the data is stored.

in = torch.cat((in, ...)) will slow down your code as you are concatenating to the same tensor in each iteration. Append to data to a list and create the tensor after all samples of the current batch were already appended to it.

Thanks a lot. This really helps with my confusion.