Optimizer.step() calls torch.cuda.synchronize()

eliird · February 3, 2025, 5:00am

I want to ask a question which might seem a bit stupid. So during training each iteration would need updated model after calling optimizer.step() function. But dont we need to call torch.cuda.synchronize at end of each iteration to ensure that the weights have been updated or is it called internally somewhere especially in case of distributed training. or is it called implicitly? i looked at the opitmizer codes but did not find synchronization call anywhere.


for i in range(iterations):
    model.train()  # Set model to training mode

    # Generate random data: 32 samples, each with 10 features
    inputs = torch.randn(32, 10)
    targets = torch.randn(32, 1)  # Random target values

    # Zero the gradients from the previous step
    optimizer.zero_grad()

    # Forward pass
    outputs = model(inputs)

    # Compute loss
    loss = criterion(outputs, targets)

    # Backward pass (calculate gradients)
    loss.backward()

    # Update model weights
    optimizer.step()

    # synchronize
    torch.cuda.synchronize()

ptrblck · February 3, 2025, 2:15pm

No, you don’t need to synchronize your code. PyTorch enqueued the compute kernels to the default CUDAStream which makes sure kernels are executed in order.

eliird · February 4, 2025, 12:21am

Thank you for reply, I understand we do not have to call it explicitly but somewhere it needs to happen right if so where? or am I mistaken.

ptrblck · February 4, 2025, 1:02am

No, you don’t need to synchronize the code at all unless you want to read an output value. In this case PyTorch will synchronize the code for you and want until the GPU is done with all compute kernels.
You can think about the general workflow (without implicit or explicit synchronizations) as the CPU scheduling CUDA kernels using data pointers only. No need to synchronize anything, since all kernels will be launched on the same CUDAStream and are thus guaranteed to not cause any race conditions. The optimizer.step() method will then enqueue the CUDA kernels needed for the actual weight update (again into the same stream). These kernels cannot be executed before the other already enqueued kernels are finished.

This story differs if you are using custom CUDA streams, as you would be responsible for the synchronizations, since multiple kernels can overlap on the GPU.