I want to ask a question which might seem a bit stupid. So during training each iteration would need updated model after calling optimizer.step() function. But dont we need to call torch.cuda.synchronize at end of each iteration to ensure that the weights have been updated or is it called internally somewhere especially in case of distributed training. or is it called implicitly? i looked at the opitmizer codes but did not find synchronization call anywhere.
for i in range(iterations):
model.train() # Set model to training mode
# Generate random data: 32 samples, each with 10 features
inputs = torch.randn(32, 10)
targets = torch.randn(32, 1) # Random target values
# Zero the gradients from the previous step
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
# Compute loss
loss = criterion(outputs, targets)
# Backward pass (calculate gradients)
loss.backward()
# Update model weights
optimizer.step()
# synchronize
torch.cuda.synchronize()
No, you don’t need to synchronize your code. PyTorch enqueued the compute kernels to the default CUDAStream which makes sure kernels are executed in order.
No, you don’t need to synchronize the code at all unless you want to read an output value. In this case PyTorch will synchronize the code for you and want until the GPU is done with all compute kernels.
You can think about the general workflow (without implicit or explicit synchronizations) as the CPU scheduling CUDA kernels using data pointers only. No need to synchronize anything, since all kernels will be launched on the same CUDAStream and are thus guaranteed to not cause any race conditions. The optimizer.step() method will then enqueue the CUDA kernels needed for the actual weight update (again into the same stream). These kernels cannot be executed before the other already enqueued kernels are finished.
This story differs if you are using custom CUDA streams, as you would be responsible for the synchronizations, since multiple kernels can overlap on the GPU.