Hi. There is a strange behavior of validation loader in my trainloop, and I can’t pinpoint what causes this. Basically, I’m doing standart trainloop with validation every N steps:
model.train()
for epoch in range(max(0, last_epoch), training_epochs):
for i, batch in enumerate(train_loader):
audio_clean, audio_ns = batch
audio_clean = audio_clean.squeeze(0)
audio_ns = audio_ns.squeeze(0)
mel_clean = torch.log(cumulative_laplace_norm(melspec(audio_clean))+1e-6).permute(0, 2, 1).to(device)
mel_ns = torch.log(cumulative_laplace_norm(melspec(audio_ns))+1e-6).permute(0, 2, 1).to(device)
# train step
optim_g.zero_grad()
mel_output = model(mel_ns)
g_loss = mel_MSE(mel_clean, mel_output)
g_loss.backward()
optim_g.step()
if rank == 0:
# validation
if steps % a.validation_interval == 0:
model.eval()
val_loss_tot_tot = 0
with torch.no_grad():
for j, batch in enumerate(validation_loader):
audio_clean, audio_ns = batch
mel_clean = torch.log(cumulative_laplace_norm(melspec(audio_clean)) + 1e-6).permute(0, 2, 1).to(device)
mel_ns = torch.log(cumulative_laplace_norm(melspec(audio_ns)) + 1e-6).permute(0, 2, 1).to(device)
mel_output = model(mel_ns)
val_loss = mel_MSE(mel_clean, mel_output)
val_loss_tot_tot += val_loss.item()
val_loss_tot = val_loss_tot_tot / (j + 1)
model.train()
if scheduler_g:
scheduler_g.step()
steps += 1
Train sampler and validation sampler are identical (different datasets obviously, but with the same structure):
train_sampler = RandomSampler(trainset)
train_loader = DataLoader(
trainset,
num_workers=24,
shuffle=False,
sampler=train_sampler,
batch_size=32,
pin_memory=True,
drop_last=True,
prefetch_factor=4,
persistent_workers=True,
)
val_sampler = RandomSampler(validset)
validation_loader = DataLoader(
validset,
num_workers=24,
shuffle=False,
sampler=val_sampler,
batch_size=32,
pin_memory=True,
drop_last=True,
prefetch_factor=4,
persistent_workers=True,
)
However, my validation behaves weirdly. It samples exactly num_workers
batches from loader and then hangs for almost a minute, then samples another num_workers
batches and hangs, and so on. This does not happen on train, sampling in train goes smoothly all the time. Changing num_workers only change how fast val loader comes to hanging. persistent_workers
and prefetch_factor
doesn’t affect this at all.
I’m NOT running DDP, this is running on single GPU.
I could blame the fact that I’m running several trainings on each gpu on same machine (CPU bottlenecking), but this happens even when single training is up.
I assume it’s not connected with .to(device), since loader hangs at the beginning of validation cycle.
Screenshot of GPU activity on validation from btop:
Worth noting that my dataloader is quite heavy and operates on CPU (I’m working with audio files and there are a bunch of augments that are not possible with GPU functions). But again, it’s completely fine on training phase.
Am I missing something or this is purely the problem with dataloader?