Hello, I’m trying to make my model (deeplabv3-resnet101) do inference inside a Docker container. I tested the inference code both inside/outside the container but got a significant time difference at model.cpu() call.
Here’s the basic code snippet:
model=torch.load('path_to_model.pt')
model.eval()
for image in images:
image = image.reshape(1, 3, 640, 640)
res = model(torch.from_numpy(image).type(torch.cuda.FloatTensor) / 255.0)
pred = res['out'].cpu().detach().numpy()[0][0]
# do other stuff...
I made sure model inference happens on cuda:0. So the inference time inside/outside docker is almost the same
Calling res['out'].cpu() (a [1,1,640,640] tensor) takes 0.08s outside container and 10s inside container. I tried adding torch.cuda.sychronize() as in here but didn’t make a difference.
So what could be the reason causing this difference and how do I solve it? Thanks!
Could you share the profiling code, please?
You’ve already mentioned that you are synchronizing the code, but I would like to make sure you are synchronizing it before starting and stopping the timer (not only in the latter case), as it would report wrong timings.
model = torch.load('path_to_model.pt')
model.eval()
for image in images:
image = image.reshape(1, 3, 640, 640)
start = time.time()
res = model(torch.from_numpy(img).type(torch.cuda.FloatTensor) / 255)
end = time.time()
print(f'Run time: {end - start}') # About 0.013s
t0 = time.time()
torch.cuda.synchronize()
t1 = time.time()
pred = res['out'].cpu().detach().numpy()[0][0]
t2 = time.time()
torch.cuda.synchronize()
t3 = time.time()
print(f'First sync time: {t1 - t0}') # About 10s inside container, 0.08s outside
print(f'CPU copy time: {t2 - t1}') # About 6e-4s both in/out
print(f'Second sync time: {t3 - t2}') # About 3e-5s both in/out
Looks like I made a mistake and used torch.cuda.synchronize() in the wrong place previously. Now it seems CPU copy doesn’t take much time, but the first synchronization call takes very long inside the container.
Thanks for the code. You are not synchronizing before each timer, so I would assume your profile might be still wrong.
Synchronize before each timer via:
Both profiles show the same model runtime as ~10s, don’t they?
Note that mid in the second example cannot be used, as you are again not synchronizing.
I’m not sure, if I understand the question correctly, but the synchronization “takes a long time”, as the previous GPU operations aren’t finished and which is the reason you have to synchronize the code to properly profile the code.
Hi, sorry for the late reply. After a discussion with the team members I decided to ditch Docker. So this problem is not a concern any more. Thanks for you help!