It seem the high CPU times were due to the asynchronous execution of the CPU and GPU instructions (see also this reply).
To get better values, I ran:
CUDA_LAUNCH_BLOCKING=1 python3 profileNetwork.py
Now the CPU times for the functions reported above are almost the same as their GPU times.
If I understand correctly, the CUDA_LAUNCH_BLOCKING flag ensures that when a CPU instruction is waiting for a result from the GPU, the waiting time is no longer accumulated into the reported CPU time.