Pytorch lightning model freezes after some epoch

mrjarvis · May 5, 2022, 12:50pm

I’m training a model on image and text input pairs from Flickr30k. Nothing special, just a Resnet18 for image and an Embedding + GRU network for text.

I’m training the model with Pytorch Lightning running on two GPUs with a DDP strategy, 16-bit precision, 512 batch size, and 8 workers in total. I defined a ModelCheckpoint that saves the 5 best iterations and an EarlyStopping callback. Both callbacks monitor the val_loss.

def train_dataloader(self) -> DataLoader:
        return DataLoader(
            self.train_data,
            batch_size=self.batch_size,
            num_workers=self.conf.data.num_workers,
            pin_memory=True,
            shuffle=True,
            drop_last=True,
            collate_fn=collate_fn,
        )

    def val_dataloader(self) -> DataLoader:
        return DataLoader(
            self.val_data,
            batch_size=self.batch_size,
            num_workers=self.conf.data.num_workers,
            pin_memory=True,
            drop_last=True,
            collate_fn=collate_fn,
        )

The problem is that after starting the sixth epoch, the model appears to have entered an infinite loop. There is a 100% used GPU and CPU usage has dropped dramatically. I can’t understand where the problem may be. Some idea?

Mohbat_Tharani · December 2, 2022, 11:01pm

Make sure you have enough number of CPU cores to handle all the workers. You could try to reduce number of worker to half number of CPUs.

Sumner_Magruder · March 10, 2023, 6:46pm

@Mohbat_Tharani I have the same issue on GPU and CPU and I hard set everything to

# create our PyLightning trainer to actualy train the model
trainer = pl.Trainer(
    max_epochs=10,
    gradient_clip_val=100,
    log_every_n_steps=5,
    accelerator='cpu',
    num_processes=1,
    max_time={'minutes': 2},
    limit_train_batches=2
)

Not only does it freeze after 3 epochs, it doesn’t time out after 2 minutes!

Mohbat_Tharani · March 24, 2023, 8:15pm

Maybe Empty coda cache after some steps/epochs: torch.cuda.empty_cache()