Restarting a distributedDataParallel

I know this is a weird use case, but I need to reset my DDP in the middle of a training run. ie. I call

model = myModel()
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
...
model = model.module
...
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)

When I do this I get the error:

RuntimeError: Tensors must be CUDA and dense

I have read that this is when tensors are sparse, but I don’t believe any of them are. I changed the second call to:

        foundOne = False
        for name, param in model.named_parameters():
            if param.is_sparse:
                print(name)
                foundOne = True
        print('Found a sparse parameter: %d' % foundOne)
        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)

and it does not find any sparse tensors. This also appears to not work even with only a single GPU, though the code runs properly with regular DataParallel or no DataParallel and just sending to(cuda).

Turns out error that said two possible things was the other thing. :man_facepalming:

Problem solved by adding

model = model.to('cuda')

Before the second call to DDP.