Same GPU, same pytorch training script. Training time sometimes fast, sometimes slow

Hi, I have a problem when training pytorch programs, it is a simple face classification task trained with cross entropy loss. The images of the same person is in the same folder. There are 320000 images in 7000 folders.

I run the same python script one the same machine(ubuntu-16.04, one GTX-1080ti, pytorch-1.8, python-3.6, cuda-10.2)

The following are the screen shot of different runs:
Fast run: (about 5 iters/s)

Slow run:(about 1 iters/s)

I assume this is the IO problem of loading data, since D means uninterruptible sleep (usually IO)

Here is the main code of me loading the data

train_transforms = T.Compose([
        T.Normalize(mean=[0.5], std=[0.5]),

train_dataset = torchvision.datasets.ImageFolder(data_dir, transform = train_transforms)

# traing_size means percentage of whole training dataset
if args.train_size != 1:
    train_idx, val_idx= train_test_split(np.arange(len(train_dataset.targets)),
                                            random_state = args.random_state, 
                                            stratify=train_dataset.targets) # data re-shuffled at every epoch 按比例分配
    train_dataset =, train_idx)
    trainloader = data.DataLoader(train_dataset,
    trainloader = data.DataLoader(train_dataset, batch_size=args.train_batch_size, shuffle=True, num_workers=args.num_workers)

I am confused with why this happens, since I am not running other programs or applications of the same time.
I have tried using data_prefetcher from an example from apex official website, but it did not solve the problem.

Can anyone give me some insights how to solve the problem, thanks a lot

It takes me 10 days and did not find the solution to this problem. It turns out to be the “shuffle=True” cause the problem, I tried several ways
(1) using data prefetcher
(2) creating the dataset using h5py dataset format, which is also too slow.
(3) I tried to create a dataloader before each epoch and using random index when creating the dataloader. But it still did not work.
(4) I use the random index as index to h5py file

ds = FaceLoader(H5Data, train_index)
trainloader = DataLoader(ds,

The training is okay on 1/4 of dataset(320K images) but getting slower and slower on whole dataset (12000K)

My CPU has 4 cores and 8 processors, My GPU is nvidia-smi

If someone has the same issue and solved it, please let me know how you solve it.
Any help or suggesstion would be appreciated.

I finally solved the problem by installing a SSD

