Load DDP model trained with 8 gpus on only 2 gpus?

I see. But it should not be the case since both are moved to args.local_rank. Anyways, I did what you suggested and also changed the test-batch-size to 1024. Here’s the outcome:

Rank 0 test on device cuda:0
Rank 1 test on device cuda:1
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 0 device: 1
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 0 device: 0
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 1 device: 0
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 2 device: 0
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 3 device: 0
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 4 device: 0
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 5 device: 0
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 6 device: 0
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 7 device: 0
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 8 device: 0
after data=data.to(device,), before output=model(data) in test function,  batch_idx: 9 device: 0
Test set: Average loss: 0.0275, Accuracy: 9913/10000 (99.13%)

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)