Checkoint Inside forward pass

gnadaf · October 22, 2020, 1:52pm

In DataParallel() with 2 GPUs.

If I save checkpoint inside forward pass, checkpoint saves the weights of both GPUs (these weights are the same). However, If I keep checkpoint outside of forward pass, checkpoint saves weights of default devices.

Which place is good to keep checkpoint, (inside forward pass or outside forward pass) ?

albanD · October 22, 2020, 2:21pm

Hi,

All these weights should have the exact same value. So I would recommend saving them outside so that you only save a single copy for them.

gnadaf · October 22, 2020, 2:34pm

Yes, If I save checkpoint outside forward pass, checkpoint saves weights of single default device.

But, If I would like to save checkpoint inside forward, checkpoint saves weights of all GPUs (which are same) I believe it is issue in PyTorch.

If it is an issue, I would like to work on it and try to avoid duplicates weights inside the checkpoint.

albanD · October 22, 2020, 2:59pm

Hi,

The DataParallel’s whole point is to run copies of your module on different GPU with different part of the input. So it is expected that the forward is called multiple times.
So if you do something in the forward that has side effects (like saving a file), then it is expected that this will not behave the same way as without DataParallel.

You can guard that part of the code to only run if you are running on GPU 0 bu that would be quite fragile. You should not do checkpointing inside the forward in general.

gnadaf · October 22, 2020, 3:17pm

I agree,
Is there any particular location where we can keep checkpoint?
I am wondering are there any use cases where we can keep checkpoint inside forward pass

gnadaf · October 22, 2020, 3:17pm

I agree,
Is there any particular location where we can keep checkpoint?
I am wondering are there any use cases where we can keep checkpoint inside forward pass

albanD · October 22, 2020, 3:46pm

From my experience we usually checkpoint in the main training loop either at each epoch or after a fixed number of batches.
Doing checkpoint for each batch is most likely slowing down your training quite significantly.