epignatelli
(Eduardo Pignatelli)
August 1, 2020, 6:11am
1
What is the correct way to update a class variable once the model has been wrapped around a DistributedDataParallel
?
In the case below, if we take a snapshot of self.k
for each model in each gpu, at the same time, we can get different results.
Any idea why that happens and how to solve that?
I guess loss
would be different across all the models?
class Model(torch.nn.Module):
def __init__(self):
self.fc = torch.nn.Linear(128, 128)
self.register_buffer("k", 0)
self.callback = lambda x: x + 1
def forward(self, x):
print(self.k)
return self.fc(x)
def training_step(self, x, y):
y_hat = self(x)
loss = torch.nn.functional.binary_cross_entropy(y_hat, y)
loss.backward()
if loss > 1.:
self.callback(loss, self.k)
1 Like
mrshenli
(Shen Li)
August 3, 2020, 2:43pm
2
Hey @epignatelli
When did you take the snapshot of self.k
? DDP
will broadcast all buffers from rank 0 process to other processes right before calling Mode.forward
. See the code below. So given the above code, the buffer should be consistent across all processes before the self.callback
is launched.
# module buffer sync
if self.broadcast_buffers and len(self.modules_buffers[0]) > 0:
# Synchronize buffers across processes.
# The process with rank 0 is considered the authoritative copy.
self._distributed_broadcast_coalesced(
self.modules_buffers[0],
self.broadcast_bucket_size)
# only do intra-node buffer sync for replicated single-device
# CUDA modules
if self.device_ids and len(self.device_ids) > 1:
# intra-node buffer sync
result = comm.broadcast_coalesced(
self.modules_buffers[0],
self.device_ids,
self.broadcast_bucket_size)
for tensors, module_buffers in zip(result[1:],
self.modules_buffers[1:]):
for tensor, buffer in zip(tensors, module_buffers):
buffer.set_(tensor)
1 Like
mrshenli
(Shen Li)
August 3, 2020, 2:44pm
3
BTW, could you please add a “distributed” tag to distributed-training related questions? So that people working on it can get back to you promptly. Thanks!