Checkpoint in Multi GPU

Is there any difference between, saving checkpoint when training with a single GPU and saving checkpoint with 2 GPUs?

Example: If I use DataParallel to train on 2 GPUs, if I save checkpoint after each epoch, which parameters will be saved? GPU1 info saved or GPU-2 info saved in checkpoint ?

How to check while tranining

1 Like

Having same doubts here, any guidance would be greatly appreciated!

nn.DataParallel will reduce all parameters to the model on the default device, so you could directly store the model.module.state_dict().
If you are using DistributedDataParallel, you would have to make sure that only one rank is storing the checkpoint as otherwise multiple process might be writing to the same file and thus corrupt it.
Here is an example how to do so.

CC @Janine

Thanks @ptrblck, in nn.DataParallel if we use 2 GPUs, checkpoint saves the data from default device (i.e GPU-0), right? GPU-2 data is not important?

Yes, that’s the case since nn.DataParallel will create new model replicas in each iteration and thus when you store the state_dict outside of the forward or backward pass, there should only be a single model on the initially specified device.

yes, I verified with the experimen.

If we use two GPUs to train model in DataParallel(), Which portion of the saved data is identical between GPU 0 and GPU 1? weights, gradients, LR, etc the same across 2 GPUs?

I think, saved data from only GPU-0 (not from GPU-2), is it correct?

@ptrblck,

In case of DataParallel().
If we use two GPUs, initially model is replicated on both GPUs, weights and bias are same on both GPUs and after calculation, weights are updated GPU-0 (default), but GPU-1 holds the same updated value?

The replicas will be recreated in each forward pass using nn.DataParallel (which is also why it’s slower than DistributedDataParallel).
Your code will most likely just use the single model, as seen here:

model = MyModel()
model = nn.DataParallel(model)
model.to('cuda:0') # push to default device

output = model(data) # DataParallel will automatically create the replicas
loss = ...
loss.backward() # DataParallel will automatically call the backward pass on all models and reduce the gradients
optimizer.step()

torch.save(model.module.state_dict(), path) # use the default model

You can find more information about the underlying workflow in DataParallel in this blog post.

DataParallel().
In case of 2 GPUs, both GPUs hold the same weights. is it correct?

Model parameters on multiple GPUs by DataParallel and DistributedDataParallel are the same. (unless GPU communication glitch happens)

You can just save parameters on GPU 0 and load them later.

  • saving
if isinstance(model, (DataParallel, DistributedDataParallel)):
    torch.save(model.module.state_dict(), model_save_name)
else:
    torch.save(model.state_dict(), model_save_name)
  • loading
state_dict = torch.load(model_name, map_location=current_gpu_device)
if isinstance(model, (DataParallel, DistributedDataParallel)):
    model.module.load_state_dict(state_dict)
else:
    model.load_state_dict(state_dict)

DataParallel().
If we use two GPUs, we know model will replicate on both GPUs, but I would like to know while saving model in checkpoint ---- Does checkpoint save the both replicated model?

I believe, the model which is present on default device (GPU-0) will be saved in the checkpoint, Right?

Yes, for DataParallel, if you save by torch.save(model.state_dict()), it will save parameters on GPU 0.

But the parameters will be saved under model.module which cannot be loaded to non-DataParallel formats. That’s why I suggest the above code that makes saving/loading compatible with nn.Module format and nn.DataParallel format.

If you use DistributedDataParallel, you should only save parameters from local rank 0. Otherwise, each DDP process will try to save model on each GPU and overwrite each other.

Thanks for detailed info.

Having deep insight, we know models are replicated in DataParallel(), and the weights of both models are the same.

Where or in which code part, model weights are updated? ?(in parallel_apply() models are replicated)

Which model’s weight is saved in checkpoint? (GPU-0 model weights and GPU-1 model weights)

model present on GPU-0 weights saved in checkpoint OR model present on GPU-1 weights are saved in checkpoint?

Thanks for detailed info.

Having deep insight, we know models are replicated in DataParallel(), and the weights of both models are the same.

Where or in which code part, model weights are updated? ?(in parallel_apply() models are replicated)

Which model’s weight is saved in checkpoint? (GPU-0 model weights and GPU-1 model weights)

model present on GPU-0 weights saved in checkpoint OR model present on GPU-1 weights are saved in checkpoint?

It would be a good idea to review the blog post suggested by ptrblck above.

In DataParallel, parallel_apply does not perform parameter update or synchronization.
replicate function synchronizes model parameters across GPUs.
dataparallel forward

As I noted, weights on GPU 0 is saved.
You can almost consider weights on other GPUs to be not really there constantly. Parameters on other GPUS are only created just before their forward computation by the above replicate function. And the replicate function is called at every forward call.

Can’t we load nn.DataParallel format checkpoint to nn.Module format model?
Is it an issue present in Pytorch checkpoint?

No. It is not an issue.
nn.DataParallel saves the parameters under self.module.
For example, let’s assume your original single-gpu model had self.conv layer.
In your DataParallel model, it will move to self.module.conv.
That’s why I recommend saving self.module.state_dict() as in the above example code.

Thanks @seungjun for the detailed Info