UCF101 training pauses periodically - bottleneck

num_cpu=12
train_dataset = torchvision.datasets.UCF101(
                UCF101_ROOT_PATH,
                UCF101_ANNO_PATH,
                frames_per_clip=12,
                step_between_clips=100,
                num_workers=num_cpu,
                train=True,
                transform=train_transforms,
                fold=self.fold,
            )

dataloader = DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            num_workers=num_cpu,
            collate_fn=custom_collate,
            shuffle=True,
            pin_memory=True,
        )

Training freezes every 50 epochs, one CPU core is active 100% (but I have 12 cores), and GPU-util is 0% on nvidia-smi.
After some time training resumes. This behavior repeats every 50 epochs.

Did you explicitly moved the data to the cuda?

I use Pytorch Lightning and it takes care of moving data to the CUDA device

Well, I don’t have any experience in using the lightning version of PyTorch but seems like a problem is occurring while transferring your data to your GPU. Is your network architecture too big? if so try a smaller one and see if the same problem occurs or not.

It was as you said there was a lot of data moved between devices. The culprit was pytorch lightning’s LightningDataModule.