How does torch.cuda.synchronize() behave?

Thanks so much for picking this up @ptrblck ! That clarifies some things for me. So a related follow up: in various blogs/snippets showing how to time a model forward pass correctly I’ve seen a pattern like:

starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
with torch.inference_mode():
    starter.record()
    model.inference(inp)
    ender.record()
    torch.cuda.synchronize()
total_time += starter.elapsed_time(ender)

Although, at least once I’ve also seen it with the ender.record() and torch.cuda.synchronize() swapped. I can see why someone might think you should swap them. Maybe they consider that ender.record() might run before the inference is done if synchronization is not done first, following the logic in this snippet. Although I’m wondering if ender.record() behaves differently from time.time(). Long question short: which way is correct and why?