I’m trying to train multi gpu DistributedDataParallel.
for epoch in range(args.start_epoch, args.num_epoch):
train(train_loader, model, scheduler, optimizer, epoch, args)
if (epoch + 1) % 1 == 0:
print('test args...',args)
validation(valid_dataset, model, epoch, args)
state = {
'epoch': epoch,
'parser': args,
'state_dict': get_state_dict(model)
}
torch.save(
state,
os.path.join(
args.save_folder,
args.dataset,
args.network,
"checkpoint_{}.pth".format(epoch)))
this code give some error
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/jake/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/jake/Gits/EfficientDet.Pytorch/train.py", line 341, in main_worker
test(valid_dataset, model, epoch, args)
File "/home/jake/Gits/EfficientDet.Pytorch/train.py", line 164, in test
evaluate(dataset, model)
File "/home/jake/Gits/EfficientDet.Pytorch/eval.py", line 190, in evaluate
generator, retinanet, score_threshold=score_threshold, max_detections=max_detections, save_path=save_path)
File "/home/jake/Gits/EfficientDet.Pytorch/eval.py", line 103, in _get_detections
2, 0, 1).cuda().float().unsqueeze(dim=0))
File "/home/jake/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jake/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/jake/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jake/Gits/EfficientDet.Pytorch/models/efficientdet.py", line 61, in forward
inputs, annotations = inputs
ValueError: not enough values to unpack (expected 2, got 1)
seems like it run gpu0 and gpu1 and try to validation one by one.
but some how second one does not recognized as validation then gives this error