Dataloader slows down when training


I have built a custom dataset for medical images saved as numpy arrays (.npy). Each dataset loads a csv with the paths to the files (2 source images and 1 target segmentation map) using pandas. The function that I am using to load the arrays is np.load. The problem I am having is that when I start up the loader it almost instantly loads the arrays (about 0.01s per array), however about 300 iterations further it takes much longer (0.4-0.5s per array). Thus my initial loading time per batch is close to 0, but after 300 iterations it varies between 5-8s per batch.

Several hypotheses I have are:

  • A memory leak. This seems unlikely because RAM usage does not increase.
  • Hardware Issue: This seems unlikely because the issue would show up much earlier.
  • Some library is causing this issue in combination with pytorch multiprocessing.

Settings of the dataloader in which the dataset is wrapped
num_workers = 2 (setting this to a higher or lower number does not solve this problem)
pin_memory = True
batch_size = 8
shuffle = True

Pytorch Version: 1.5.1
OS: Windows 10
GPU: Nvidia Quadro P6000
RAM: 64 GB


Loop (simplified, stripped the deep learning stuff, just dataloading)

def train(dct: Dict[str, Any]) -> bool:
    Training loop for UNet 3D + slices2D context method of intput
    :param dct:
    Finished: bool

    # make pytorch deterministic

    # enable automatic garbage collection
    # gc.enable()

    # setup network
    network_config = setup_segmentation_network(dct)

    # get dataset parameters
    train_dataset, test_dataset, loader_dct = get_datasets_and_parameters(

    # setup loaders
    train_loader = DataLoader(
            pin_memory = True

    test_loader = DataLoader(
    # create Summarywriter and logger
    train_writer = SummaryWriter(dct.tensorboard_dir + os.sep + 'train')
    val_writer = SummaryWriter(dct.tensorboard_dir + os.sep + 'evaluation')
    evaluation_writer = EvaluationWriter(dct)

    for epoch in range(dct.epochs+1):

        for batch_idx, batch_samples in enumerate(tqdm(train_loader, desc='Epoch {}'.format(epoch))):

            start = time.time()
            end = time.time()
            print('Iter: {} \t Loading Time {}'.format(batch_idx, end - start))



            if batch_idx > 0 and batch_idx % 10 ==0:

            del batch_samples#, loss


Dataloader (Simplified to just load the data, no augmentation or other preprocessing)

from import Dataset

import pandas as pd
import numpy as np
from typing import Dict, Any
import time
import gc

from ...helpers.initialize_params import set_preprocessing_func_params_from_dict, \
    set_preprocessing_func_params_from_list, set_augmentation_params

from ...helpers.apply_preprocessing import preprocess
from ...helpers.apply_function_list import apply_function_list
from ...helpers.correct_image_dimensions import correct_format

class MultiModalScanNPYDataset(Dataset):
    Loader for multi modal input i.e. CT + NCCT, or different MR sequences

    def __init__(self, csv_file, params):
        super(MultiModalScanNPYDataset, self).__init__()
        print('____________ INITIALIZED ___________')
        # load the csv file
        df = pd.read_csv(csv_file)
        self.paths = df.loc[:, ~df.columns.str.contains('^Unnamed')]

        # get the column names for the source and target volumes
        self.source_names = params.source_names
        self.target_name = params.target_name

        # set names of how each of the source and target names is returned by dataloader
        self.source_out_names = self.source_names
        self.target_out_name = self.target_name

        if 'source_out_names' in params:
            self.source_out_names = params.source_out_names

        if 'target_out_name' in params:
            self.target_out_name = params.target_out_name

        self.preprocessing_funcs = None
        if 'preprocessing_funcs' in params.preprocessing:
            self.preprocessing_funcs = params.preprocessing.preprocessing_funcs

            if isinstance(params.preprocessing.preprocessing_funcs, dict):
                self.preprocessing_funcs = set_preprocessing_func_params_from_dict(

            elif isinstance(params.preprocessing.preprocessing_funcs, list):
                self.preprocessing_funcs = set_preprocessing_func_params_from_list(

        # if augmenting, initialize augmentation objects
        self.augmentation = False
        self.augmentation_objs = None
        if 'augmentation_objs' in params:
            self.augmentation_objs = set_augmentation_params(

    def __len__(self):
        return len(self.paths)

    def __getitem__(self, idx):

        # resample random data augmentation parameters
        if self.augmentation_objs and self.augmentation:
            for augmentation_object in self.augmentation_objs:

        # load and preprocess source images
        out_images = {}
        for out_name, name in zip(self.source_out_names, self.source_names):
            st = time.time()
            volume = np.load(self.paths.iloc[[idx]][name].values[0]).astype(np.float32)

            # TODO: fix preprocess function, now just puts images into dict
            out_images = preprocess(
                volume, out_images, out_name,
                name, self.preprocessing_funcs)

            # get both the patient_id and slice number
            out_images['patient_id'], out_images['slice_number'] = patient_id.split('_')

        # apply data augmentation on source images
        target_segmentation = np.load(self.paths.iloc[[idx]][self.target_name].values[0]).astype(np.int16)

         out_images[self.target_out_name] = target_segmentation

        del target_segmentation, volume

        corrected = correct_format(out_images)

        return corrected

    def get_sample(self, item):
        return self.__getitem__(item)

    def set_augmentation(self):
        if self.augmentation:
            self.augmentation = False
            self.augmentation = True

Thank you for your help!


Well, you can check the performance with no workers.
Anyway it seems strange, you should check your hard disk temperature and performance. Cheap hard disks have a performance drop.

Thank you Juan for your response. Without workers, (and by using the main process) the same issue occurs.

In response to potential issues with the hard drive. I have just checked the status of the harddrive and the read/write speeds. These seem to be fine. In addition, I have transferred about 1 TB of data from the hard drive a few days ago. I was able to do this at around 30 mb/s over long periods of time. So it don’t think it is a hardware issue.

So can you try plain data loading? This is, just iterating over the dataloader and nothing else to discard the issue is related to other potential code (IO, gpu bottleneck etcetera…)

Well, the code I have provided does that right now.

I have just iterated over the dataloader, about 300 -350 iterations in it goes from practically 0 seconds to 5-8 seconds per batch.

Sorry then I have no clue.
Numpy is really stable and widely used as dataloader.
Last thing I would suggest is to check not to use pytorch’s dataloader. But in the simplest case it’s just an iterator.


This issue seems to have been caused by a fragmented hard drive. Windows has automatic defragmentation built in, however this only happens periodically.

The preprocessing I used (turning the volumes into stacks of slices) caused the hard drive to become fragmented. After manually defragmenting the hard drive loading speed was constant.

1 Like