Hi,
I work with videos and have a dataloader that fetches short 3-frame sequences for training and full 150-frame sequences for validation. While the former is usually fast, although, in the DDP regime, it may take a while to initialize the Dataloader, the latter is much slower compared to the non-DDP version. Non-DDP dataloading takes ~1.5s per sequence, while the DDP one is ~4min. Of these 4min, most of the time is spent on dataloader initialization, where everything is frozen. When the iteration starts, each step in the loop is also slower than in the non-DDP setting, but not significantly. Adding more GPUs in a single-node setup makes the problem worse.
Here is my setup:
train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
train_loader = DataLoader(
train_dataset,
batch_size=args.batch_size,
sampler=train_sampler,
collate_fn=collate_fn,
num_workers=args.num_workers,
)
val_sampler = DistributedSampler(val_dataset, num_replicas=world_size, rank=rank)
val_batch_size = min(max(1, args.num_workers), args.batch_size)
val_loader = DataLoader(
val_dataset,
batch_size=val_batch_size,
sampler=val_sampler,
collate_fn=collate_fn,
num_workers=args.num_workers,
)
, where args.num_workers=4
. In the non-DDP setup, the val_loader
looks like this:
val_loader = DataLoader(
val_dataset,
batch_size=val_batch_size,
shuffle=False,
collate_fn=collate_fn,
num_workers=args.num_workers,
)
Every epoch I iterate over each loader once.
I tried persistent_workers=True
, pin_memory=True
, decreasing val_batch_size
, setting num_workers=0
, NCCL_P2P_DISABLE=1
, but it didn’t improve the dataloading speed for the validation set. While I could change the way I load validation data, I would like to know if it is possible to optimize the current setup.
My questions are on how I could further optimize my setup to reduce dataloading freezes during validation and how to write optimal dataloaders for the distributed mode in general?