Torchvision model speed differs inside/outside docker environment

Hello, I’m trying to make my model (deeplabv3-resnet101) do inference inside a Docker container. I tested the inference code both inside/outside the container but got a significant time difference at model.cpu() call.
Here’s the basic code snippet:

model=torch.load('path_to_model.pt')
model.eval()
for image in images:
    image = image.reshape(1, 3, 640, 640)
    res = model(torch.from_numpy(image).type(torch.cuda.FloatTensor) / 255.0)
    pred = res['out'].cpu().detach().numpy()[0][0]
    # do other stuff...
  • Docker base: pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime
  • GPU: GeForce RTX 2080Ti
  • I made sure model inference happens on cuda:0. So the inference time inside/outside docker is almost the same
  • Calling res['out'].cpu() (a [1,1,640,640] tensor) takes 0.08s outside container and 10s inside container. I tried adding torch.cuda.sychronize() as in here but didn’t make a difference.

So what could be the reason causing this difference and how do I solve it? Thanks!

Could you share the profiling code, please?
You’ve already mentioned that you are synchronizing the code, but I would like to make sure you are synchronizing it before starting and stopping the timer (not only in the latter case), as it would report wrong timings.

Thanks for your reply. Here’s the profiling code:

model = torch.load('path_to_model.pt')
model.eval()

for image in images:
    image = image.reshape(1, 3, 640, 640)
    start = time.time()
    res = model(torch.from_numpy(img).type(torch.cuda.FloatTensor) / 255)
    end = time.time()
    print(f'Run time: {end - start}')     # About 0.013s

    t0 = time.time()
    torch.cuda.synchronize()
    t1 = time.time()
    pred = res['out'].cpu().detach().numpy()[0][0]
    t2 = time.time()
    torch.cuda.synchronize()
    t3 = time.time()
    print(f'First sync time: {t1 - t0}')             # About 10s inside container, 0.08s outside
    print(f'CPU copy time: {t2 - t1}')               # About 6e-4s both in/out
    print(f'Second sync time: {t3 - t2}')          # About 3e-5s both in/out

Looks like I made a mistake and used torch.cuda.synchronize() in the wrong place previously. Now it seems CPU copy doesn’t take much time, but the first synchronization call takes very long inside the container.

Thanks for the code. You are not synchronizing before each timer, so I would assume your profile might be still wrong.
Synchronize before each timer via:

torch.cuda.synchronize()
tX = time.perf_counter()

Thanks for the suggestion. I’ve changed the profiling code to the following:

### Profiling model run time

torch.cuda.synchronize()
start = time.perf_counter()

res = model(torch.from_numpy(img).type(torch.cuda.FloatTensor) / 255)

torch.cuda.synchronize()
end = time.perf_counter()
print(f'Run time: {end - start}')

### Profiling GPU-CPU copy time
torch.cuda.synchronize()
t1 = time.perf_counter()

pred = res['out'].cpu().detach().numpy()[0][0]
                    
torch.cuda.synchronize()
t2 = time.perf_counter()
print(f'CPU copy time: {t2 - t1}')

The result is (inside docker):
Run time: 10.947372228954919
CPU copy time: 0.0007527889683842659

I did another profiling with the following code to see how long the synchronization takes:

torch.cuda.synchronize()
start = time.perf_counter()

res = model(torch.from_numpy(img).type(torch.cuda.FloatTensor) / 255)

mid = time.perf_counter()

torch.cuda.synchronize()
end = time.perf_counter()
print(f'Run time: {end - start}')
print(f'Second sync call time: {end - mid}')

And got:
Second sync call time: 10.835095343994908

From my perspective the 2nd torch.cuda.synchronize() is taking a long time in the docker. Do you have any suggestions? Thanks.

Both profiles show the same model runtime as ~10s, don’t they?
Note that mid in the second example cannot be used, as you are again not synchronizing.

I’m not sure, if I understand the question correctly, but the synchronization “takes a long time”, as the previous GPU operations aren’t finished and which is the reason you have to synchronize the code to properly profile the code.

Yes, I agree with you. As you mentioned in some other posts, GPU calls are asynchronous so syncing forces them to finish.

But back to my first post - I want to know why this 10 seconds happens only in the docker container? :thinking:

Could you post a code snippet including the synchronizations, which shows the 10s execution in the docker and the faster one outside, please?

Hi, sorry for the late reply. After a discussion with the team members I decided to ditch Docker. So this problem is not a concern any more. Thanks for you help!