Hi,
I have a codebase that is using a custom Iterator over a numpy array. The iterator extends torch.utils.data.IterableDataset
and implements __next__
and __iter__
methods of the super class.
I am trying to switch to Dataset
+ Dataloader
implementation of the same setup. I bring a simplified example below.
class NumpyArrayDataset(Dataset):
def __init__(self) -> None:
self.data = np.random.randint(low=[0, 0, 0], high=[1226, 1226, 29517], size=(66038, 3))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx]
And the use this in my code with
dataset = NumpyArrayDataset()
dataloader = DataLoader(
dataset=dataset,
batch_size=batch_size,
shuffle=True
)
In the iterable dataset scenario each batch is pushed onto the device by constructing a tensor from numpy array directly on the GPU. While in the Dataset scenario, since PyTorch automatically returns Tensors if the data is numpy array I simply push the data onto the device with .to(device)
after retrieving each batch.
This switch alone causes more than 3x drop in the speed of training, going from around 632,214 samples/second
to 182,106 samples/second
speeds.
The training job is running in Data Parallel setup on a node with 8 V100 GPUs. Admittedly, GPU utilization is pretty low (2% on the main and around 1% on the rest).
I tried experimenting with pin_memory=True
with various num_workers
results in much, much worse speed degradations.
The goal is to migrate to DDP, and DDP requires Dataloader with a Sampler, so hence my efforts to switch.
Any ideas on what might be the cause of this?