Same GPU, same pytorch training script. Training time sometimes fast, sometimes slow

Lizhen_Ji · October 21, 2021, 12:06pm

Hi, I have a problem when training pytorch programs, it is a simple face classification task trained with cross entropy loss. The images of the same person is in the same folder. There are 320000 images in 7000 folders.

I run the same python script one the same machine(ubuntu-16.04, one GTX-1080ti, pytorch-1.8, python-3.6, cuda-10.2)

The following are the screen shot of different runs:
Fast run: (about 5 iters/s)

Slow run:(about 1 iters/s)

I assume this is the IO problem of loading data, since D means uninterruptible sleep (usually IO)

Here is the main code of me loading the data


train_transforms = T.Compose([
        T.Grayscale(), 
        T.RandomCrop(args.input_shape[1:]),
        T.RandomHorizontalFlip(),
        T.ToTensor(),
        T.Normalize(mean=[0.5], std=[0.5]),
])

train_dataset = torchvision.datasets.ImageFolder(data_dir, transform = train_transforms)

# traing_size means percentage of whole training dataset
if args.train_size != 1:
    train_idx, val_idx= train_test_split(np.arange(len(train_dataset.targets)),
                                            train_size=args.train_size,
                                            shuffle=True,
                                            random_state = args.random_state, 
                                            stratify=train_dataset.targets) # data re-shuffled at every epoch 按比例分配
    train_dataset = torch.utils.data.Subset(train_dataset, train_idx)
    trainloader = data.DataLoader(train_dataset,
                                    batch_size=args.train_batch_size,
                                    shuffle=True,
                                    num_workers=args.num_workers)
else:
    trainloader = data.DataLoader(train_dataset, batch_size=args.train_batch_size, shuffle=True, num_workers=args.num_workers)

I am confused with why this happens, since I am not running other programs or applications of the same time.
I have tried using data_prefetcher from an example from apex official website, but it did not solve the problem.

Can anyone give me some insights how to solve the problem, thanks a lot

Lizhen_Ji · November 3, 2021, 1:15am

It takes me 10 days and did not find the solution to this problem. It turns out to be the “shuffle=True” cause the problem, I tried several ways
(1) using data prefetcher
(2) creating the dataset using h5py dataset format, which is also too slow.
(3) I tried to create a dataloader before each epoch and using random index when creating the dataloader. But it still did not work.
(4) I use the random index as index to h5py file

np.random.seed(i) 
np.random.shuffle(train_index)
ds = FaceLoader(H5Data, train_index)
trainloader = DataLoader(ds,
                batch_size=args.train_batch_size,
                shuffle=False,
                num_workers=args.num_workers, 
                pin_memory=True,
                drop_last=True,
                persistent_workers=True)

The training is okay on 1/4 of dataset(320K images) but getting slower and slower on whole dataset (12000K)

My CPU has 4 cores and 8 processors, My GPU is nvidia-smi

If someone has the same issue and solved it, please let me know how you solve it.
Any help or suggesstion would be appreciated.

Lizhen_Ji · November 8, 2021, 9:18am

I finally solved the problem by installing a SSD

Claudia_Giardina · May 10, 2023, 11:52am

Hello, how do you know is the shuflle equals to True that is causing the problem?
I am having really big variations in the execution times from one epoch to another and I would like to understand the cause.
Whay changing to SSD would be a solution, can someone explain it further?
Thanks

ptrblck · May 11, 2023, 9:05am

Random read accesses on a spinning disk are painfully slow which is why updating to an SSD could help.

Claudia_Giardina · May 11, 2023, 2:40pm

Thanks for the clarification!