How can I measure how much time one of the GPU is straggling?

When I am training a job with multiple GPUs on a single server, I suspect that one of the GPUs is running somehow slower than the other GPU.

For example, I suspect that 3 of 4 GPUs finish the forward and backward in 1s while the rest completes by (1 + X) s. During X, 3 of 4 GPUs are waiting without any GPU utilization. After X, PyTorch starts to synchronize the model weights/parameters among 4 GPUs.

How can I measure the X? Thanks!

You could use e.g. Nsight Systems to create a timeline for the workload and see the workloads of each GPU.