Checkpoint in Multi GPU

gnadaf · September 30, 2020, 8:15pm

Is there any difference between, saving checkpoint when training with a single GPU and saving checkpoint with 2 GPUs?

Example: If I use DataParallel to train on 2 GPUs, if I save checkpoint after each epoch, which parameters will be saved? GPU1 info saved or GPU-2 info saved in checkpoint ?

How to check while tranining

Janine · October 1, 2020, 2:31am

Having same doubts here, any guidance would be greatly appreciated!

ptrblck · October 2, 2020, 7:11am

nn.DataParallel will reduce all parameters to the model on the default device, so you could directly store the model.module.state_dict().
If you are using DistributedDataParallel, you would have to make sure that only one rank is storing the checkpoint as otherwise multiple process might be writing to the same file and thus corrupt it.
Here is an example how to do so.

CC @Janine

gnadaf · October 2, 2020, 12:24pm

Thanks @ptrblck, in nn.DataParallel if we use 2 GPUs, checkpoint saves the data from default device (i.e GPU-0), right? GPU-2 data is not important?

ptrblck · October 2, 2020, 9:25pm

Yes, that’s the case since nn.DataParallel will create new model replicas in each iteration and thus when you store the state_dict outside of the forward or backward pass, there should only be a single model on the initially specified device.

gnadaf · October 2, 2020, 11:20pm

yes, I verified with the experimen.

If we use two GPUs to train model in DataParallel(), Which portion of the saved data is identical between GPU 0 and GPU 1? weights, gradients, LR, etc the same across 2 GPUs?

I think, saved data from only GPU-0 (not from GPU-2), is it correct?

gnadaf · October 4, 2020, 12:28am

@ptrblck,

In case of DataParallel().
If we use two GPUs, initially model is replicated on both GPUs, weights and bias are same on both GPUs and after calculation, weights are updated GPU-0 (default), but GPU-1 holds the same updated value?

ptrblck · October 4, 2020, 3:22am

The replicas will be recreated in each forward pass using nn.DataParallel (which is also why it’s slower than DistributedDataParallel).
Your code will most likely just use the single model, as seen here:

model = MyModel()
model = nn.DataParallel(model)
model.to('cuda:0') # push to default device

output = model(data) # DataParallel will automatically create the replicas
loss = ...
loss.backward() # DataParallel will automatically call the backward pass on all models and reduce the gradients
optimizer.step()

torch.save(model.module.state_dict(), path) # use the default model

You can find more information about the underlying workflow in DataParallel in this blog post.

gnadaf · October 4, 2020, 3:09pm

DataParallel().
In case of 2 GPUs, both GPUs hold the same weights. is it correct?

seungjun · October 5, 2020, 2:20am

Model parameters on multiple GPUs by DataParallel and DistributedDataParallel are the same. (unless GPU communication glitch happens)

You can just save parameters on GPU 0 and load them later.

saving

if isinstance(model, (DataParallel, DistributedDataParallel)):
    torch.save(model.module.state_dict(), model_save_name)
else:
    torch.save(model.state_dict(), model_save_name)

loading

state_dict = torch.load(model_name, map_location=current_gpu_device)
if isinstance(model, (DataParallel, DistributedDataParallel)):
    model.module.load_state_dict(state_dict)
else:
    model.load_state_dict(state_dict)

gnadaf · October 5, 2020, 2:35am

DataParallel().
If we use two GPUs, we know model will replicate on both GPUs, but I would like to know while saving model in checkpoint ---- Does checkpoint save the both replicated model?

I believe, the model which is present on default device (GPU-0) will be saved in the checkpoint, Right?

seungjun · October 5, 2020, 3:09am

Yes, for DataParallel, if you save by torch.save(model.state_dict()), it will save parameters on GPU 0.

But the parameters will be saved under model.module which cannot be loaded to non-DataParallel formats. That’s why I suggest the above code that makes saving/loading compatible with nn.Module format and nn.DataParallel format.

If you use DistributedDataParallel, you should only save parameters from local rank 0. Otherwise, each DDP process will try to save model on each GPU and overwrite each other.

gnadaf · October 5, 2020, 3:33am

Thanks for detailed info.

Having deep insight, we know models are replicated in DataParallel(), and the weights of both models are the same.

Where or in which code part, model weights are updated? ?(in parallel_apply() models are replicated)

Which model’s weight is saved in checkpoint? (GPU-0 model weights and GPU-1 model weights)

model present on GPU-0 weights saved in checkpoint OR model present on GPU-1 weights are saved in checkpoint?

gnadaf · October 5, 2020, 3:33am

Thanks for detailed info.

Having deep insight, we know models are replicated in DataParallel(), and the weights of both models are the same.

Where or in which code part, model weights are updated? ?(in parallel_apply() models are replicated)

Which model’s weight is saved in checkpoint? (GPU-0 model weights and GPU-1 model weights)

model present on GPU-0 weights saved in checkpoint OR model present on GPU-1 weights are saved in checkpoint?

seungjun · October 6, 2020, 3:03am

It would be a good idea to review the blog post suggested by ptrblck above.

In DataParallel, parallel_apply does not perform parameter update or synchronization.
replicate function synchronizes model parameters across GPUs.
dataparallel forward

As I noted, weights on GPU 0 is saved.
You can almost consider weights on other GPUs to be not really there constantly. Parameters on other GPUS are only created just before their forward computation by the above replicate function. And the replicate function is called at every forward call.

gnadaf · October 22, 2020, 2:15pm

Can’t we load nn.DataParallel format checkpoint to nn.Module format model?
Is it an issue present in Pytorch checkpoint?

seungjun · October 23, 2020, 1:32am

No. It is not an issue.
nn.DataParallel saves the parameters under self.module.
For example, let’s assume your original single-gpu model had self.conv layer.
In your DataParallel model, it will move to self.module.conv.
That’s why I recommend saving self.module.state_dict() as in the above example code.

gnadaf · October 26, 2020, 2:32pm

Thanks @seungjun for the detailed Info