I was training a model using DDP and I wished to train multiple instance with some difference in configuration. However, when I tried to checkout to another branch, the running instance suddenly report an error and stopped. The error also only existed in the newly checkout branch, indicating the process actually read the modified code. Is that supposed to happen and how exactly does this work? I don’t quite understand…
I saw there’s a similar question but unfortunately no one had answer.