When I tried DDP to train a model, I found it’s not difficult. For testing, I found different code examples. The simplest is only test your model in GPU 0 and stop all other processes. But I found an error like this:
[E ProcessGroupNCCL.cpp:566] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806103 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
And I think only testing on the main GPU is a waste of computing resources. So I change my code like this:
for epoch_id in range(max_epoch):
train_loader.sampler.set_epoch(epoch_id)
for it, (input, category) in enumerate(train_loader):
model.train()
optim.zero_grad()
logits = model(input)
loss = criterion(logits, category)
loss.backward()
optim.step()
iter += 1
if iter % 1000 == 0:
dist.barrier()
acc = test(test_loader, model) # dist.reduce here
if dist.get_rank() == 0:
if acc > best_acc:
save(model)
I don’t want to test the model at the end of each epoch because the dataset is really large. I use a counter to test the model every 1000 iterations. So I set dist.barrier() to keep multi-processed in sync for testing. And I use dist.reduce() to collect the results.
My question is: Is this the right way to test a model in DDP? When my code runs into test() function, are the weights of models in different processes the same or not? Am I using dist.barrier() in the right way?