Dataloader is more than 4x slower

Hi,

I have a codebase that is using a custom Iterator over a numpy array. The iterator extends torch.utils.data.IterableDataset and implements __next__ and __iter__ methods of the super class.

I am trying to switch to Dataset + Dataloader implementation of the same setup. I bring a simplified example below.

class NumpyArrayDataset(Dataset):
    def __init__(self) -> None:
        self.data = np.random.randint(low=[0, 0, 0], high=[1226, 1226, 29517], size=(66038, 3))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

And the use this in my code with

    dataset = NumpyArrayDataset()
    dataloader = DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        shuffle=True
    )

In the iterable dataset scenario each batch is pushed onto the device by constructing a tensor from numpy array directly on the GPU. While in the Dataset scenario, since PyTorch automatically returns Tensors if the data is numpy array I simply push the data onto the device with .to(device) after retrieving each batch.

This switch alone causes more than 3x drop in the speed of training, going from around 632,214 samples/second to 182,106 samples/second speeds.

The training job is running in Data Parallel setup on a node with 8 V100 GPUs. Admittedly, GPU utilization is pretty low (2% on the main and around 1% on the rest).

I tried experimenting with pin_memory=True with various num_workers results in much, much worse speed degradations.

The goal is to migrate to DDP, and DDP requires Dataloader with a Sampler, so hence my efforts to switch.

Any ideas on what might be the cause of this?

Dataset and Dataloader classes are most appropriate with data that is accessed from your hard drive. But there are issues in Windows operating systems when setting num_workers ≠ 0.

If you can upload all of your data into ram or, better yet, onto a gpu, you can use something like this:

Hi,

In both scenarios I have the entire dataset loaded into memory. The code is running on a Linux machine. In a training loop, I push every batch onto a GPU. The only difference is that with the custom iterator I build a tensor directly on a GPU, since the iterator returns a slice of a numpy array

    source = torch.tensor(data[:, 0], dtype=torch.int64, device=device, requires_grad=False)
    target = torch.tensor(data[:, 1], dtype=torch.int64, device=device, requires_grad=False)

While with dataset/dataloader setup, I only push it onto a GPU, since PyTorch automatically converts the numpy into a Tensor.

    source = data[:, 0].to(device)
    target = data[:, 1].to(device)

I am just curious what is it in the Dataloader that is causing such a crazy slowdown. I tried running the model on a single GPU (no DataParallel) and the difference is even more striking. On a single GPU, the iterator runs at 2,784,729 samples/second, while the Dataset/Dataloader approach runs 206,568 samples/second. Pinning memory or changing number of workers only causes slow-down.

Edit: So batching is what appears to be the culprit. In the iterator implementation the next() returns the batch, while in the Dataloader approach batching happens “behind the scenes”. I tried dropping the batch sizes to 1 in both scenarios and speeds are somewhat comparable (Dataloader is still a tad slow at 4,663,136 samples/second while the iterator is at 4,861,653 samples/second).

Any ideas on what could be causing such a slow down when batching in the Dataloader?

Alternatively can I use a custom iterator that returns a batch in the next() method with DDP? How would i go implementing the Sampler?

The main advantage to the DataLoader is loading files from a hard drive. It does so iteratively because each process can be assigned to cpu cores with the num_workers argument.

If I’m not mistaken, the next() call will just return 1 sample from the vanilla data loader.

Regarding DDP, in the example given here, there is no need to split data for each GPU. You just pass in the train data into the model and Pytorch will handle distribution behind the scenes.

On a side note, you may get an additional speedup if you can keep your custom slicing process handled on a GPU as Pytorch tensors. Pytorch tensors on GPU can be sliced by NumPy arrays from CPU, so you can also take advantage of NumPy’s shuffling function on the index. This can be especially beneficial if your dataloading pipeline applies any calculation heavy augmentations or transforms.

I see, thanks! That explains why num_workers does not help in my scenario since all the data is already in memory.

From this post it looks like it will not shard the data and iterate over the entire dataset if you don’t pass a DistributedSampler :frowning: