I know this is a weird use case, but I need to reset my DDP in the middle of a training run. ie. I call
model = myModel()
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
...
model = model.module
...
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
When I do this I get the error:
RuntimeError: Tensors must be CUDA and dense
I have read that this is when tensors are sparse, but I don’t believe any of them are. I changed the second call to:
foundOne = False
for name, param in model.named_parameters():
if param.is_sparse:
print(name)
foundOne = True
print('Found a sparse parameter: %d' % foundOne)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
and it does not find any sparse tensors. This also appears to not work even with only a single GPU, though the code runs properly with regular DataParallel or no DataParallel and just sending to(cuda).