I’m training a model on image and text input pairs from Flickr30k. Nothing special, just a Resnet18 for image and an Embedding + GRU network for text.
I’m training the model with Pytorch Lightning running on two GPUs with a DDP strategy, 16-bit precision, 512 batch size, and 8 workers in total. I defined a ModelCheckpoint that saves the 5 best iterations and an EarlyStopping callback. Both callbacks monitor the val_loss.
def train_dataloader(self) -> DataLoader: return DataLoader( self.train_data, batch_size=self.batch_size, num_workers=self.conf.data.num_workers, pin_memory=True, shuffle=True, drop_last=True, collate_fn=collate_fn, ) def val_dataloader(self) -> DataLoader: return DataLoader( self.val_data, batch_size=self.batch_size, num_workers=self.conf.data.num_workers, pin_memory=True, drop_last=True, collate_fn=collate_fn, )
The problem is that after starting the sixth epoch, the model appears to have entered an infinite loop. There is a 100% used GPU and CPU usage has dropped dramatically. I can’t understand where the problem may be. Some idea?