DistributedDataParallel barrier doesn't work as expected during evaluation

@Euruson I think I’ve figured out the problem here. You are still using DDP for the validation phase even though it runs only on one rank. Even though you might not run the backward pass for DDP during eval phase, the forward pass for DDP might still invoke some collective operations (ex: syncing buffers or syncing indices when it rebuilts buckets the first time). As a result, what is happening is that your collective ops are mismatched and some of the collective ops for DDP’s forward pass on rank 0 match up with the barrier() call on rank 1 leading it to leave the barrier.

If you make the following code change, your script seems to be working as expected:

if phase == "val":
  outputs = model.module(inputs)
else:
  outputs = model(inputs)

model.module retrieves the underlying non-replicated model which you can use for validation. The output on my local machine is as follows with this change:

19:39:05 071604     | Rank:0 - Epoch 0/24
19:39:05 071607     | Rank:1 - Epoch 0/24
19:39:05 071672     | ----------
19:39:08 620338     | Rank: 1 - train Loss: 0.4468 Acc: 0.7787
19:39:08 620479     | Rank:1 waiting before the barrier
19:39:08 651507     | Rank: 0 - train Loss: 0.5222 Acc: 0.7623
19:39:10 524626     | Rank: 0 - val Loss: 0.2312 Acc: 0.9281
19:39:10 524726     | Rank:0 waiting before the barrier
19:39:10 524973     | Rank:0 left the barrier
19:39:10 524994     | Rank:1 left the barrier
19:39:10 525106     | Rank:1 - Epoch 1/24
19:39:10 525123     | Rank:0 - Epoch 1/24
19:39:10 525156     | ----------
19:39:13 735254     | Rank: 1 - train Loss: 0.3994 Acc: 0.8197
19:39:13 735366     | Rank:1 waiting before the barrier
19:39:13 739752     | Rank: 0 - train Loss: 0.4128 Acc: 0.8197
19:39:15 298398     | Rank: 0 - val Loss: 0.2100 Acc: 0.9216
19:39:15 298483     | Rank:0 waiting before the barrier
19:39:15 298672     | Rank:0 left the barrier
19:39:15 298702     | Rank:0 - Epoch 2/24
19:39:15 298716     | ----------
19:39:15 298728     | Rank:1 left the barrier
19:39:15 298811     | Rank:1 - Epoch 2/24
19:39:18 586375     | Rank: 0 - train Loss: 0.4336 Acc: 0.8156
19:39:18 605651     | Rank: 1 - train Loss: 0.3094 Acc: 0.8893
19:39:18 605791     | Rank:1 waiting before the barrier
19:39:20 199963     | Rank: 0 - val Loss: 0.2205 Acc: 0.9216
19:39:20 200061     | Rank:0 waiting before the barrier
19:39:20 200296     | Rank:0 left the barrier
19:39:20 200329     | Rank:0 - Epoch 3/24
7 Likes