DDP is affected by code modification

I was training a model using DDP and I wished to train multiple instance with some difference in configuration. However, when I tried to checkout to another branch, the running instance suddenly report an error and stopped. The error also only existed in the newly checkout branch, indicating the process actually read the modified code. Is that supposed to happen and how exactly does this work? I don’t quite understand…

I saw there’s a similar question but unfortunately no one had answer.

The behavior seems to depend on the OS or rather the method to fork or spawn the processes as described here.

1 Like