Pytorch DataLoader freezes when num_workers > 0

i am facing exactly this same issue : DataLoader freezes randomly when num_workers > 0 (Multiple threads train models on different GPUs in separate threads) · Issue #15808 · pytorch/pytorch · GitHub in windows 10,

i used, anaconda virtual environment where i have,

  • python 3.8.5
  • pytorch 1.7.0
  • cuda 11.0
  • cudnn 8004
  • gpu rtx 3060ti
  • Is CUDA available: Yes

related post : multiprocessing - PyTorch Dataloader hangs when num_workers > 0 - Stack Overflow

and Training freezes when using DataLoader with num_workers > 0 - #5 by mobassir94

i got no help for this problem since last 9 days,so re posting about it here!


I had similar issue a month ago.
I solve the problem as follow.
Make sure your dataloader is call after if __name__ == '__main__'.

if __name__ == '__main__':
    dataloader = ...

i am calling dataloader like this and it doesn’t work :

if __name__ == '__main__':
     # for training only, need nightly build pytorch

    folds = StratifiedKFold(n_splits=CFG['fold_num'], shuffle=True, random_state=CFG['seed']).split(np.arange(train.shape[0]), train.label.values)
    for fold, (trn_idx, val_idx) in enumerate(folds):

        print('Training with {} started'.format(fold))

        print(len(trn_idx), len(val_idx))
        train_loader, val_loader = prepare_dataloader(train, trn_idx, val_idx, data_root='c:/cassava-leaf-disease-classification/train_images/')

        device = torch.device(CFG['device'])
        model = CassvaImgClassifier(CFG['model_arch'], train.label.nunique(), pretrained=True).to(device)

        model.avg_pool = GeM()
        print("model loaded")
        scaler = GradScaler()   
        optimizer = torch.optim.Adam(model.parameters(), lr=CFG['lr'])
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, gamma=0.1, step_size=CFG['epochs']-1)
        acc_max = 0.
        count = 0

        for epoch in range(CFG['epochs']):
            if epoch < 4:
                # freeze backbone layers
                for param in model.parameters():
                    count +=1
                    if count < 5: #freezing first 4 layers
                        param.requires_grad = False

                for param in model.parameters():
                    param.requires_grad = True

            train_one_epoch(epoch, model, optimizer, train_loader, device, scheduler=scheduler, schd_batch_update=False)

            with torch.no_grad():
                val_acc = valid_one_epoch(epoch, model, val_loader, device, scheduler=None, schd_loss_update=False)
                if (val_acc > acc_max):
                    acc_max = val_acc
  ,'E:/{}_fold_{}_{}_ValAcc_{}'.format(CFG['model_arch'], fold, epoch,val_acc))
        del model, optimizer, train_loader, val_loader, scaler, scheduler

Sorry, I mean., batch_size=32, num_workers=8)
under if __name__ == '__main__'

under if __name__ == '__main__'

i am calling prepare_dataloader and inside prepare_dataloader i have, batch_size=32, num_workers=8)

so yeah, batch_size=32, num_workers=8) is already under if __name__ == '__main__'

It is not working at all. Besides, I have no idea why moving the dataloader code under the if_main has any relationship with this problem.