If I save checkpoint inside forward pass, checkpoint saves the weights of both GPUs (these weights are the same). However, If I keep checkpoint outside of forward pass, checkpoint saves weights of default devices.
Which place is good to keep checkpoint, (inside forward pass or outside forward pass) ?
The DataParallel’s whole point is to run copies of your module on different GPU with different part of the input. So it is expected that the forward is called multiple times.
So if you do something in the forward that has side effects (like saving a file), then it is expected that this will not behave the same way as without DataParallel.
You can guard that part of the code to only run if you are running on GPU 0 bu that would be quite fragile. You should not do checkpointing inside the forward in general.
From my experience we usually checkpoint in the main training loop either at each epoch or after a fixed number of batches.
Doing checkpoint for each batch is most likely slowing down your training quite significantly.