I had a few models training with distributed pytorch (DistributedSampler + DistributedDataParallel + multiprocessing). While the models were training for a few days, I changed a part of the data transformation code, where I renamed a file and changed all the necessary imports.
After I changed this part of the code, the models that were training all suddenly crashed when initializing the next epoch. They all crashed with error messages along the lines of “No module named __”.
What’s weird is that this module is only loaded when initializing each process, and the training loop is confined within each spawned process. Thus, I’m not sure why changing the name of this module caused my code to crash mid-training. Is this a common issue in multiprocessing? Am I misunderstanding something here?
PS. In case it helps to know which module it was…
The module I changed was a file named
transformations.transforms. I changed it to
transformations.single_transforms since it seemed to be interfering with
torchvision.transforms. As usual, loading transformations only occurs once in the code just before loading the dataset.
Also, it wasn’t like the training crashed as soon as I made this change - it crashed after finishing 1 epoch of training which is also weird…
Thanks in advance!