Training fails mid-run when code is changed for distributed training


I had a few models training with distributed pytorch (DistributedSampler + DistributedDataParallel + multiprocessing). While the models were training for a few days, I changed a part of the data transformation code, where I renamed a file and changed all the necessary imports.

After I changed this part of the code, the models that were training all suddenly crashed when initializing the next epoch. They all crashed with error messages along the lines of “No module named __”.

What’s weird is that this module is only loaded when initializing each process, and the training loop is confined within each spawned process. Thus, I’m not sure why changing the name of this module caused my code to crash mid-training. Is this a common issue in multiprocessing? Am I misunderstanding something here?

PS. In case it helps to know which module it was…
The module I changed was a file named transformations.transforms. I changed it to transformations.single_transforms since it seemed to be interfering with torchvision.transforms. As usual, loading transformations only occurs once in the code just before loading the dataset.
Also, it wasn’t like the training crashed as soon as I made this change - it crashed after finishing 1 epoch of training which is also weird…

Thanks in advance!

1 Like

Do you have some code that we can examine? I put together a very simple program with multiprocessing and tried changing one of the module names during execution, but there was no crash. Are you potentially loading the module within the Dataset or Dataloader classes? Given that the program code is compiled into Python bytecode, and only then interpreted, I’m not sure why changing the Python code mid-execution will affect the program since it’s just the bytcode being interpreted. In fact, even deleting the bytecode during execution shouldn’t make a difference since the program has been loaded into memory. Any thoughts on what could cause this behavior @mrshenli?

As an aside, I would recommend checkpointing (and potentially torchelastic) so you don’t lose training progress for long-running jobs.

So my entire code base is actually quite large, but here are some necessary details.

As I said, the module import error occurred after changing transformations.transforms to transformations.single_transforms. This module is only directly imported in transformations.__init__ where my transform_factory code resides.

The multiprocessing code is structured as follows. As usual, I have some relevant setup in the main function, where I spawn the main_process function.

In the main_process function, the code is structured as:

  1. Init process group
  2. Obtain transformations (through transform_factory - probably the module in question)
  3. Create Datasets, Distributed samplers, and Dataloaders
  4. Create distributed model, loss fns, optimizers, etc.
  5. Initialize the trainer class
  6. Run training

The last step - run training - is basically just a nested for loop where I run one round of training followed by one round of evaluation. Simplified example:

    def run(self):
        for epoch in range(0, self.num_epochs):
            if self.train_sampler is not None:
            for phase in ['train', 'val']:
                if phase == 'train':
                    train_results = self.train_one_epoch(epoch)
                    val_results = self.validate(epoch)

In no part of the training do I reference or try to import from transformations.single_transforms. And as you said, even if I did, it shouldn’t matter because of the python Bytecode.

Some other details that may help…

  1. I’ve been using PyTorch for around 3 years now, mostly using DataParallel, and I’ve never encountered this issue before. I only switched to Distributed training a few weeks ago, and it’s my first time seeing this problem.
  2. The training doesn’t bug out as soon as the change is made. In fact, it will finish its current phase of training (until the dataloader is finished iterating), then die during the transition from train --> evaluation or vice versa.

Finally, thanks for the suggestion. I do in fact checkpoint every epoch, so I can resume training. I just wanted to get to the bottom of this because I just can’t understand why the code would crash mid-training. I talked to one of my colleagues about this, and he said that he’s experienced something similar in distributed training. He also noted that changing the model structure will also cause the code to crash, but I haven’t checked this for myself.

This is when num_workers > 0. Every new worker will run the files.

Well I use DDP to speed up my training, so I’m not sure if reducing the num_workers to 0 - and thus, slowing down the code - would be a solution :frowning: