Why is Dataloader faster than simply torch.cat() on Dataset?

fried-chicken · January 5, 2022, 8:02am

I have a Dataset named unlabeled_set and a corresponding DataLoader named unlabeled_loader. To get batched data, I know that I can do the following on DataLoader.

unlabeled_loader = DataLoader(unlabeled_set, batch_size=batch_size, shuffle=True)

for img,_ in tqdm(unlabeled_loader):
    out=model(img.to(device))

which runs in the speed of 1.2it/s (that is, 0.83s/it) by tqdm.

However, if I just iterate the the dataset and use torch.cat() to get the batched data like that:

for _ in tqdm(range(53)):
    in = torch.as_tensor([])
    for j in range(128):
        in = torch.cat((in, torch.unsqueeze(unlabeled_set[_ * 128 + j][0], 0)), 0)
    out = model(in.to(device))

where 128 is the batch_size and 53 is (dataset_len / 128) . This way runs in the speed of 1.55s/it, much slower.

So my question is: Why is Dataloader faster? How to modify the latter to make it faster?

J_Johnson · January 5, 2022, 1:06pm

Not sure what your batch size is, but that sounds incredibly slow. Are you using Windows?

If so, Windows has some issues with parallelizing workers, as mentioned here:

github.com/pytorch/pytorch

In windows, DataLoader with num_workers > 0 is extremely slow (50 times slower)

opened 07:06PM - 18 Oct 18 UTC

taohu88

high priority module: windows module: multiprocessing triaged

## 🐛 Bug In windows, DataLoader with num_workers > 0 is extremely slow (pytor…ch=0.41) ## To Reproduce Step 1: create two loader, one with num_workers and one without. import torch.utils.data as Data train_loader = Data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) train_loader2 = Data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, num_workers=1) Step 2: time it %%time for _ in range(200): for x in train_loader: pass %%time for _ in range(200): for x in train_loader2: pass The first one took only 2.5 s The second one took 1min 47s --------------------------------------------------------------------------- Collecting environment information... PyTorch version: 0.4.1 Is debug build: No CUDA used to build PyTorch: None OS: Microsoft Windows 10 Enterprise GCC version: Could not collect CMake version: version 3.11.0-rc3 Python version: 3.6 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA

If your dataset can fit into cpu memory or another gpu, I suggest putting all your data into memory. With this loader, I went from several seconds a batch to a fraction of a second. Here is an example:

import torch
import numpy as np
from numpy.random import default_rng


class FastLoader():
    def __init__(self, dataset, labels, batch_size, device, testing=False):
        self.length = dataset.size()[0]
        self.dataset = dataset
        self.labels = labels
        self.indexer = np.arange(self.length)
        self.testing = testing
        self.batch_size = batch_size
        self.max_idx = self.length // batch_size + 1
        self.fin_batch_len = self.length - self.length // batch_size

        self.device = device

        if not testing:
            self.shuffler()

    def shuffler(self):
        rng = default_rng()
        rng.shuffle(self.indexer)

    def __len__(self):
        return self.length

    def get_batch(self, idx):
        if idx == self.max_idx:
            mini_idx = self.indexer[idx * self.batch_size:idx * self.batch_size + self.fin_batch_len]
        else:
            mini_idx = self.indexer[idx * self.batch_size:idx * self.batch_size + self.batch_size]

        data = self.dataset[mini_idx, ...]
        labels = self.labels[mini_idx, ...]

        # preprocessing goes here | make sure any tensors made are cast to self.device
        
        return data, labels


# Usage Example
device = torch.device("cpu")
A = torch.rand((10000, 3, 32, 32), device=device)
labels = torch.rand(10000, device=device)
trainloader = FastLoader(A, labels, batch_size=128, device=device)

for idx in range(trainloader.max_idx):
    data, labels = trainloader.get_batch(idx)
    print(data.size(), labels.size())
    
# You can reshuffle the trainloader indices between epochs with:
trainloader.shuffler()

With the above, all you need to feed it is a tensor of the whole dataset. It should have the dims you use in each batch such as (batch_size, channels, dims, dims)

import glob
import pandas as pd


def data2gputens(path, device, data_dim):
    out = torch.empty((0, *data_dim), device=device, requires_grad=False, dtype=torch.float32)
    start_column, stop_column = 1, 6
    for p_idx, fname in enumerate(glob.glob(path + "*.csv")):
        data_file = pd.read_csv(fname, header=None)
        out = torch.cat([out, torch.tensor(data_file.iloc[:, start_column:stop_column].values, device=device,
                                           requires_grad=False, dtype=torch.float32).view(-1, *data_dim)])
    return out

# Usage of data2gputens :
directory = "data/"
device = torch.device("cpu")
data_dim = (95, 5)
train_dataset = data2gputens(directory, device, data_dim)

Of course, you’ll have to adjust the above definition to your data’s size, whether they be images, sequences, etc. and where the data is stored.

ptrblck · January 10, 2022, 6:08am

in = torch.cat((in, ...)) will slow down your code as you are concatenating to the same tensor in each iteration. Append to data to a list and create the tensor after all samples of the current batch were already appended to it.

fried-chicken · January 10, 2022, 7:58am

Thanks a lot. This really helps with my confusion.