Optimize Dataloader Speed


so I have been trying for some time to optimize my pytorch dataloader now to ensure the cpu is not bottlenecking my training. For this, firstly, I have switched from Windows 11 to Ubuntu 22.04 LTS. The reason is that the worker creating in multiprocessing on Windows consumes much more time compared to Ubuntu. I immediately saw that in Ubuntu, in fact, the worker creation has much lower overhead. Also, the general speed per epoch has increased, however, I am still not sure whether everything is optimally tuned.

To give you some information about my dataloading structure. I wrote a custom dataset, a custom batch sampler as well as a custom dataloader class.

The custom dataset (shortened version shown here) loads all my training data into memory and the getitem method returns samples directly from the in-memory-stored numpy array:

class CustomDataset(Dataset):
    def __init__(self,df):
        # some preprocessing steps
        # ...

        # convert the pd.df to np.array held in-memory
        self.data = df.values
    def __getitem__(self,idx):
        # raise
        if idx >= len(self): raise IndexError

        # compute data indices for slicing
        # idx0, idx1
        # features 
        values  = self.data[:,:,idx0:idx1]
        return values

The dataset should be ok like this, I guess. Next, my batch sampler class is written like this:

class CustomSampler():
    # mode shuffle: return batches of size batch_size until length of entire dataset is reached
    # mode sorted:  return ordered batches of size batch_size in steps of horizon until end of dataset is reached

    def __init__(self,dataset,batch_size,epoch_size):
        self.length = len(dataset)
        self.batch_size = batch_size
        self.epoch_size = epoch_size

    def __len__(self):
        return self.epoch_size

    def __iter__(self):
        yield from self.get_random_index()

    def get_random_index(self):
        for _ in range(len(self)):

            # at first, a random end point between [batch_size, length(dataset)] is drawn
            end = torch.randint(low=self.batch_size, high=self.length, size=(), dtype=torch.int64)

            # then, a random index with entries betwen [0,end] of size batch_size is drawn
            yield torch.randint(low=0,high=end+1,size=(self.batch_size,))

The batch_sampler’s epoch_size is the number of batches that a full epoch will return from the dataloader. When getting an invidual batch, at first, a random end point “end” is drawn. Then, batch_size indices between [0,end] are drawn and passed as indices to the dataset via dataloader.

Finally, the dataloader init looks sth like this:

class CustomDataLoader(torch.utils.data.DataLoader):
    def __init__(self,dataset,batch_size,epoch_size=None,num_workers=0,prefetch_factor=2):

        # at first create the batch_sampler 
        batch_sampler = CustomSampler(dataset,batch_size,epoch_size)

        # next, create dataloader with shuffle = False (will be overwritten by the sampler state)
            persistent_workers=True if num_workers>0 else False,
            prefetch_factor=prefetch_factor if num_workers>0 else 2

This looks like a standard dataloader with standard sampling. The reason why I wrote custom classes for the batch_sampler and dataloder is that the batch_sampler can have several states, which I did not show the code for here. In the basic state, the batch_sampler (shown here) acts like a normal sampler which returns shuffled indices between [0,length(dataset)].

I found that using pin_memory, persistent_workers and a tuned prefetch_factor yields significant speedup (compared to pin_memory = False and persistent_workers = False). In order to evaluate the best settings, I ran a small grid searchers over (num_workers,prefetch_factor). The average times per epoch (test run: 100 epochs) are shown in the table below:


The rows show the value of num_workers (nw) and the columns prefetch_factor (pf). As visible, all settings yield approximately the same speed per epoch, except num_workers=0 (not sure why its lower?). This leads me to believe that my CPU is not actually bottlenecking the training procedure, as otherwise there should be a local minimum for some setting with num_workers>0.

While doing the individual 100-epoch-runs to compute the figures in the table from left to right top to bottom, I monitored the GPU stats using nvidia-smi dmon -d 2 -f log.txt. This saves stats such as memory usage, gpu utilization, temperature, etc in 2 second steps to a file. The results for GPU utilization is shown here:

The periods where utilization drops to zero indicate the actual pauses between the runs. One can see that after num_workers=0, the utilization is at a steady 80+% for all subsequent runs, which indicates that the GPU is well utilized and possibly not bottlenecked by the CPU (right?) (note: a similar plot is obtained for memory utilization, which is approximately at 75% all the time, but BELOW 20% for the first runs with num_workers=0). What still puzzles me is that first runs (for nw=0), although they have lower utilization, yields the shortest average times per epoch. I really don’t get that. If this is real, the parallelization seems to impair performance rather than improve it. This is quite inconsistent. Does anybody have an idea what could be the cause? Are the settings of the dataloader ok?

I hope that someone can help me clarify this. Thanks!
Best, JZ

How is the epoch time being measured here? In general I would also experiment with increasing the number of workers until there is no speedup observed.