Hi,
I’ve come across a very peculiar bug in my code. Although I’ve been able to bypass it by using adding a torch.cuda.synchronize()
in my loop, I really can’t understand why this synchronization is to get reliable outputs.
The real code I use is a bit complicated, so I’ve whipped up a dummy version of it that shows the general process:
class PeculiarBug:
def __init__(self, model1, model2, dataloader):
# Two models, both on GPU. Iterating over dataloader returns images casted to GPU.
self.model1 = model1
self.model2 = model2
self.dataloader = dataloader
# HookManager adds forward hooks to intermediate layers of the input model.
# Intermediate layer outputs are appended to a list within the HookManager
self.hook_manager1 = HookManager(self.model1)
self.hook_manager2 = HookManager(self.model2)
self.A = torch.zeros(10, 10).cuda()
self.B = torch.zeros(10, 1).cuda()
self.C = torch.zeros(1, 10).cuda()
def calculate():
tempA = torch.zeros(10, 10).cuda()
tempB = torch.zeros(10, 1).cuda()
tempC = torch.zeros(1, 10).cuda()
for imgs in self.dataloader:
self.model1(imgs)
self.model2(imgs)
# Get list of intermediate features from hook manager
layers1: List[Tensor] = self.hook_manager1.get_features()
layers2: List[Tensor] = self.hook_manager2.get_features()
# Calculate the tempA matrix. All matrix operations on GPU.
# Updates self.A inplace via Tensor.__iadd__ method
# (e.g., something like self.A += tempA)
self.calculate_A(layers1, layers2, tempA)
# Calculate the tempB and tempC matrix. All operations are on GPU,
# and updates to self.B and self.C are inplace via Tensor.__iadd__
# (e.g., something like self.B += tempB)
self.calculate_BC(layers1, layers2, tempB, tempC)
# self.A, B, and C have been updated in the functions.
# Fill tempA, B, C back with zeros
tempA.fill_(0)
tempB.fill_(0)
tempC.fill_(0)
# clear hook features
self.hook_manager1.clear_features()
self.hook_manager2.clear_features()
return self.A / (self.B * self.C)
# Example code for the calculate_BC function
def calculate_BC(self, layers1, layers2, tempB, tempC):
# tempB is changed inplace.
for start_idx in range(0, 50, group_size):
end_idx = min(start_idx + group_size, 50)
X = torch.stack([layers1[i] for i in range(start_idx, end_idx)], dim=0)
tempB[0, start_idx:end_idx] += some_metric(X, X)
self.B += tempB
# similar operation for tempB and self.C
...
The problem
Given two identical models as inputs, the returned square matrix of calculate()
must have ones on the diagonal. HOWEVER, running this code fast returns a matrix with values <1 on the diagonal. Interestingly, this bug shows the following properties:
- Bug seems to be dependent on how fast the loop is run. For example, this bug will appear when using ResNet18 as the two models, but not when using ResNet50 (presumably because the forward pass is slower on Res50). Furthermore, adding a breakpoint inside the for loop and slowly stepping through the loop will return valid results. BUT, quickly stepping over breakpoints will mess up the results again.
-
torch.cuda.synchronize()
can be added in any line within the for loop to bypass this issue (yes, I’ve checked every line). - Following up on point 2, it doesn’t even have to be
torch.cuda.synchronize()
specifically; for example, adding something likeif (self.A / (self.B * self.C)).diagonal().sum() < 50: breakpoint()
will not set off the breakpoint and will return valid results. (In this case, the comparison operator seems to be a synchronization point). This makes it incredibly hard to debug, since trying to pinpoint the bug will lead to the bug not appearing at all. It’s very akin to the “observer effect” in Physics, where trying to observe the process affects the process itself. - I have tried adding assertions to the
len(layers1) == len(layers2)
andlen(layers1) == 10
, which do not raise anAssertionError
. (Adding this assertion does not fix the bug).
In essence, it seems like self.A
, self.B
, and self.C
are not being updated in a timely manner (or even skipped completely?). I have suspected that the self.B += tempB
and fill_(0)
operations may be executing in the incorrect order, but this doesn’t make sense theoretically, since they are both tensor operations and should be queued up in the correct order. Also, if self.hook_manager1.clear_features()
were to run before any of the calculation functions, the code should raise other errors such as a mismatch in dimensions.
Has anyone experienced something similar, or provide any insights into what may be going wrong? Although adding synchronization points does solve the issue, I can’t seem to pinpoint the issue, and don’t know if it’s just a temporary solution.
Thanks for reading this long post