I’ve come across a very peculiar bug in my code. Although I’ve been able to bypass it by using adding a
torch.cuda.synchronize() in my loop, I really can’t understand why this synchronization is to get reliable outputs.
The real code I use is a bit complicated, so I’ve whipped up a dummy version of it that shows the general process:
class PeculiarBug: def __init__(self, model1, model2, dataloader): # Two models, both on GPU. Iterating over dataloader returns images casted to GPU. self.model1 = model1 self.model2 = model2 self.dataloader = dataloader # HookManager adds forward hooks to intermediate layers of the input model. # Intermediate layer outputs are appended to a list within the HookManager self.hook_manager1 = HookManager(self.model1) self.hook_manager2 = HookManager(self.model2) self.A = torch.zeros(10, 10).cuda() self.B = torch.zeros(10, 1).cuda() self.C = torch.zeros(1, 10).cuda() def calculate(): tempA = torch.zeros(10, 10).cuda() tempB = torch.zeros(10, 1).cuda() tempC = torch.zeros(1, 10).cuda() for imgs in self.dataloader: self.model1(imgs) self.model2(imgs) # Get list of intermediate features from hook manager layers1: List[Tensor] = self.hook_manager1.get_features() layers2: List[Tensor] = self.hook_manager2.get_features() # Calculate the tempA matrix. All matrix operations on GPU. # Updates self.A inplace via Tensor.__iadd__ method # (e.g., something like self.A += tempA) self.calculate_A(layers1, layers2, tempA) # Calculate the tempB and tempC matrix. All operations are on GPU, # and updates to self.B and self.C are inplace via Tensor.__iadd__ # (e.g., something like self.B += tempB) self.calculate_BC(layers1, layers2, tempB, tempC) # self.A, B, and C have been updated in the functions. # Fill tempA, B, C back with zeros tempA.fill_(0) tempB.fill_(0) tempC.fill_(0) # clear hook features self.hook_manager1.clear_features() self.hook_manager2.clear_features() return self.A / (self.B * self.C) # Example code for the calculate_BC function def calculate_BC(self, layers1, layers2, tempB, tempC): # tempB is changed inplace. for start_idx in range(0, 50, group_size): end_idx = min(start_idx + group_size, 50) X = torch.stack([layers1[i] for i in range(start_idx, end_idx)], dim=0) tempB[0, start_idx:end_idx] += some_metric(X, X) self.B += tempB # similar operation for tempB and self.C ...
Given two identical models as inputs, the returned square matrix of
calculate() must have ones on the diagonal. HOWEVER, running this code fast returns a matrix with values <1 on the diagonal. Interestingly, this bug shows the following properties:
- Bug seems to be dependent on how fast the loop is run. For example, this bug will appear when using ResNet18 as the two models, but not when using ResNet50 (presumably because the forward pass is slower on Res50). Furthermore, adding a breakpoint inside the for loop and slowly stepping through the loop will return valid results. BUT, quickly stepping over breakpoints will mess up the results again.
torch.cuda.synchronize()can be added in any line within the for loop to bypass this issue (yes, I’ve checked every line).
- Following up on point 2, it doesn’t even have to be
torch.cuda.synchronize()specifically; for example, adding something like
if (self.A / (self.B * self.C)).diagonal().sum() < 50: breakpoint()will not set off the breakpoint and will return valid results. (In this case, the comparison operator seems to be a synchronization point). This makes it incredibly hard to debug, since trying to pinpoint the bug will lead to the bug not appearing at all. It’s very akin to the “observer effect” in Physics, where trying to observe the process affects the process itself.
- I have tried adding assertions to the
len(layers1) == len(layers2)and
len(layers1) == 10, which do not raise an
AssertionError. (Adding this assertion does not fix the bug).
In essence, it seems like
self.C are not being updated in a timely manner (or even skipped completely?). I have suspected that the
self.B += tempB and
fill_(0) operations may be executing in the incorrect order, but this doesn’t make sense theoretically, since they are both tensor operations and should be queued up in the correct order. Also, if
self.hook_manager1.clear_features() were to run before any of the calculation functions, the code should raise other errors such as a mismatch in dimensions.
Has anyone experienced something similar, or provide any insights into what may be going wrong? Although adding synchronization points does solve the issue, I can’t seem to pinpoint the issue, and don’t know if it’s just a temporary solution.
Thanks for reading this long post