Correctly profiling the whole loading-training-writing pipeline

I am training my networks on two different machines.

They have pretty much the same GPU: the first one uses tesla v100 16GB with nvlink, the second uses tesla v100 32GB with pcie. The main differences stand in CPUs and storage, but I can see that the same script is much slower on the second machine (2x or even more sometimes).

I would like to understand where is the bottleneck in the second machine. In particular I would like to find precisely how much time these operations require: data loading, CPU->GPU, gpu computation as a whole (forward+backward+update), GPU->CPU, data writing (e.g. dumping the output images).

Can I rely on simply measuring the time between the calls in my script?

I have already tried this method and it seems that the GPU computations on the second machine are much slower, but this makes no sense to me since the specs say they should achieve almost the same performance.

Is it possible that the asynchronism caused by CUDA streams is giving me fake results?