CPU vs GPU timing of cuda operations

It seem the high CPU times were due to the asynchronous execution of the CPU and GPU instructions (see also this reply).

To get better values, I ran:

CUDA_LAUNCH_BLOCKING=1 python3 profileNetwork.py

Now the CPU times for the functions reported above are almost the same as their GPU times.

If I understand correctly, the CUDA_LAUNCH_BLOCKING flag ensures that when a CPU instruction is waiting for a result from the GPU, the waiting time is no longer accumulated into the reported CPU time.