A question about model test in DDP

When I tried DDP to train a model, I found it’s not difficult. For testing, I found different code examples. The simplest is only test your model in GPU 0 and stop all other processes. But I found an error like this:

[E ProcessGroupNCCL.cpp:566] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806103 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

And I think only testing on the main GPU is a waste of computing resources. So I change my code like this:

    for epoch_id in range(max_epoch):
        for it, (input, category) in enumerate(train_loader):
            logits = model(input)
            loss = criterion(logits, category)

            iter += 1
            if iter % 1000 == 0:
                acc = test(test_loader, model)  # dist.reduce here
                if dist.get_rank() == 0:
                    if acc > best_acc:

I don’t want to test the model at the end of each epoch because the dataset is really large. I use a counter to test the model every 1000 iterations. So I set dist.barrier() to keep multi-processed in sync for testing. And I use dist.reduce() to collect the results.

My question is: Is this the right way to test a model in DDP? When my code runs into test() function, are the weights of models in different processes the same or not? Am I using dist.barrier() in the right way?

I think your example should works as expected whatever there is dist.barrier() or not. DDP are synced at loss.backward().

Testing/inference with DDP is somewhat more tricky than training. If using DistributedSampler to scatter your data, You should ensure the number of testing data is divisible to the number of your GPUs, otherwise the results might be incorrect.
Say, there’re 100 batches in your testing set, while there are 8 GPUs (100 % 8 = 4). The DistributedSampler will repeat part of the data and expand it to 104 (104 % 8 = 0) such that the data could be evenly loaded into each GPU.

Got it. I noticed this special difference between training and testing and I set the GPUs and batch_size divisible. It has worked well so far. Thank you!