Dataloading of video sequences takes much longer with DistributedSampler

Hi,
I work with videos and have a dataloader that fetches short 3-frame sequences for training and full 150-frame sequences for validation. While the former is usually fast, although, in the DDP regime, it may take a while to initialize the Dataloader, the latter is much slower compared to the non-DDP version. Non-DDP dataloading takes ~1.5s per sequence, while the DDP one is ~4min. Of these 4min, most of the time is spent on dataloader initialization, where everything is frozen. When the iteration starts, each step in the loop is also slower than in the non-DDP setting, but not significantly. Adding more GPUs in a single-node setup makes the problem worse.
Here is my setup:

        train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
        train_loader = DataLoader(
            train_dataset,
            batch_size=args.batch_size,
            sampler=train_sampler,
            collate_fn=collate_fn,
            num_workers=args.num_workers,
        )
        val_sampler = DistributedSampler(val_dataset, num_replicas=world_size, rank=rank)
        val_batch_size = min(max(1, args.num_workers), args.batch_size)
        val_loader = DataLoader(
            val_dataset,
            batch_size=val_batch_size,
            sampler=val_sampler,
            collate_fn=collate_fn,
            num_workers=args.num_workers,
        )

, where args.num_workers=4. In the non-DDP setup, the val_loader looks like this:

val_loader = DataLoader(
    val_dataset,
    batch_size=val_batch_size,
    shuffle=False,
    collate_fn=collate_fn,
    num_workers=args.num_workers,
)

Every epoch I iterate over each loader once.
I tried persistent_workers=True, pin_memory=True, decreasing val_batch_size, setting num_workers=0, NCCL_P2P_DISABLE=1, but it didn’t improve the dataloading speed for the validation set. While I could change the way I load validation data, I would like to know if it is possible to optimize the current setup.
My questions are on how I could further optimize my setup to reduce dataloading freezes during validation and how to write optimal dataloaders for the distributed mode in general?

I would assume copying the Dataset might take the majority of the time. Are you eagerly pre-loading all samples or are you using lazy loading?

Samples are lazy-loaded within __getitem__. At initialization, it reads several files from the disk, each up to a few KB in size.
I also experimented with multithreading within __getitem__ for loading long sequences. It gives a substantial speedup in other cases, except for the one described.

Try to profile this section or for the sake of debugging, replace the actual reads from disk with generating fake or static data to see if maybe the disk reads slow down the worker initialization.