Is GPU synchronization a speed overhead?

Hi! I defined a network and put it into cuda.I entered a picture and got the result.I found that the time to take the results is much longer than the time spent by the model prediction.0.007s vs 0.139s!!

I searched for the relevant questions. I know that this is the time consumed by GPU synchronization, but I defined the same network with tensorflow. I can easily get the result. It seems that there is no time consumed by synchronization. I want to ask whether there is a mistake in my way of using pytorch or a mistake in pytorch strategy.
After all, 0.139 seconds of printing results are much more than the 0.007 seconds predicted by the model? Here’s my code.
def load_model(self):
network.load_net(self.model_path, self.net)
if self.cuda:
self.net.cuda()
self.net.eval()

def predict(self, data):
# load a image
img = cv2.imread(data, 0)
img = img.astype(np.float32, copy=False)
img = img.reshape((1, 1, img.shape[0], img.shape[1]))

# inference
t1 = time.time()
density_map = self.net(img)
t2 = time.time()

# torch.cuda.synchronize()

t3 = time.time()
density_map = density_map.data.cpu().numpy()
t4 = time.time()
print(t2 - t1)
print(t4 - t3)

et_count = np.sum(density_map)

0.00782918930053711
0.13913774490356445
Thank you for your advice !

your thinking is quite off.

GPU synchronization just waits for all of the queued work to finish.

Regardless of the framework, the time it takes should be about the same.