Timing isn't working correctly

Hi, my code runs extremely slow so I was checking each step for the reason, however, there seems to be something wrong with the timing in PyTorch

       t1 = torch.cuda.Event(enable_timing=True)
        t2 = torch.cuda.Event(enable_timing=True)        
        t3 = torch.cuda.Event(enable_timing=True)        

        torch.cuda.synchronize()
        t1.record()
        
        mask_loss_2 = nn.functional.binary_cross_entropy(out_mask, target_mask)
                
        torch.cuda.synchronize()
        t2.record()
       # self.mask_loss = nn.BCELoss(reduction='mean')
        mask_loss = self.mask_loss(out_mask, target_mask)
        
        torch.cuda.synchronize()
        t3.record()

        if show_time:
            torch.cuda.synchronize()
            print(" Loss time ", t1.elapsed_time(t2), t2.elapsed_time(t3))

and the results are

Loss time  594.671630859375 4.7564802169799805

But then I switched the order of 2 step:

torch.cuda.synchronize()
        t1.record()
        mask_loss = self.mask_loss(out_mask, target_mask)
                
        torch.cuda.synchronize()
        t2.record()

        mask_loss_2 = nn.functional.binary_cross_entropy(out_mask, target_mask)
        
        torch.cuda.synchronize()
        t3.record()

and the results are

Loss time  580.7462158203125 4.7769598960876465

So it seems the first BCEloss in each step took much longer time than the second
I also feel that this timing isn’t accurate because I’m pretty sure the major time consumption happened in some complex operations after this step. Do you know why is this happening?

Thanks!

I would recommend to add some warmup workloads to avoid accidentally profiling the CUDA context creation and general startup time as seen here:

# warmup
x = torch.randn(1024, 1024, device='cuda')
out = torch.matmul(x, x)

# create events
t1 = torch.cuda.Event(enable_timing=True)
t2 = torch.cuda.Event(enable_timing=True)        
t3 = torch.cuda.Event(enable_timing=True)        

# profile1
t1.record()
out = torch.matmul(x, x)
         
# profile2
t2.record()
for _ in range(10):
    out = torch.matmul(x, x)
 
# end timer
t3.record()

# print results
torch.cuda.synchronize()
print(" Loss time ", t1.elapsed_time(t2), t2.elapsed_time(t3))

Changing the workload (single matmul and the matmul in the loop) results in the expected swapped timings.