When I call the forward function of my model it looks like buffer values are being overwritten. I have created a DDP and in my training loop I have the following block:
print('1')
print(id(net.module.layer1.subModule1.subModules2[0].buffer1))
print(id(net.module.layer2.subModule1.subModules2[0].buffer1))
print(net.module.layer1.subModule1.subModules2[0].buffer1)
print(net.module.layer2.subModule1.subModules2[0].buffer1)
output = net(inputs)
then in my network definition I have:
def forward(self, x):
    print('start')
    print(id(self.layer1.subModule1.subModules2[0].buffer1))
    print(id(self.layer2.subModule1.subModules2[0].buffer1))
    print(self.layer1.subModule1.subModules2[0].buffer1)
    print(self.layer2.subModule1.subModules2[0].buffer1)
This prints:
1
132397489023616
132397489023696
tensor([1], device=‘cuda:1’)
tensor([1], device=‘cuda:1’)
start
132397489023616
132397489023696
tensor([1233784], device=‘cuda:1’)
tensor([0], device=‘cuda:1’)
So clearly something about calling forward is overwriting values. What could be causing this? It pretty consistently happens right at this line, and its always the first batch but not always the first epoch, so I dont think its just a random race condition from the other thread. I have seen other threads mention that overwriting is an expected condition, but the value is just set to 1 in the beginning and is never changed anywhere so these random values are not being directly copied over.