Torchvision model speed differs inside/outside docker environment

JingyuQian · July 16, 2021, 1:16am

Hello, I’m trying to make my model (deeplabv3-resnet101) do inference inside a Docker container. I tested the inference code both inside/outside the container but got a significant time difference at model.cpu() call.
Here’s the basic code snippet:

model=torch.load('path_to_model.pt')
model.eval()
for image in images:
    image = image.reshape(1, 3, 640, 640)
    res = model(torch.from_numpy(image).type(torch.cuda.FloatTensor) / 255.0)
    pred = res['out'].cpu().detach().numpy()[0][0]
    # do other stuff...

Docker base: pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime
GPU: GeForce RTX 2080Ti
I made sure model inference happens on cuda:0. So the inference time inside/outside docker is almost the same
Calling res['out'].cpu() (a [1,1,640,640] tensor) takes 0.08s outside container and 10s inside container. I tried adding torch.cuda.sychronize() as in here but didn’t make a difference.

So what could be the reason causing this difference and how do I solve it? Thanks!

ptrblck · July 16, 2021, 5:50am

Could you share the profiling code, please?
You’ve already mentioned that you are synchronizing the code, but I would like to make sure you are synchronizing it before starting and stopping the timer (not only in the latter case), as it would report wrong timings.

JingyuQian · July 16, 2021, 6:52pm

Thanks for your reply. Here’s the profiling code:

model = torch.load('path_to_model.pt')
model.eval()

for image in images:
    image = image.reshape(1, 3, 640, 640)
    start = time.time()
    res = model(torch.from_numpy(img).type(torch.cuda.FloatTensor) / 255)
    end = time.time()
    print(f'Run time: {end - start}')     # About 0.013s

    t0 = time.time()
    torch.cuda.synchronize()
    t1 = time.time()
    pred = res['out'].cpu().detach().numpy()[0][0]
    t2 = time.time()
    torch.cuda.synchronize()
    t3 = time.time()
    print(f'First sync time: {t1 - t0}')             # About 10s inside container, 0.08s outside
    print(f'CPU copy time: {t2 - t1}')               # About 6e-4s both in/out
    print(f'Second sync time: {t3 - t2}')          # About 3e-5s both in/out

Looks like I made a mistake and used torch.cuda.synchronize() in the wrong place previously. Now it seems CPU copy doesn’t take much time, but the first synchronization call takes very long inside the container.

ptrblck · July 16, 2021, 9:13pm

Thanks for the code. You are not synchronizing before each timer, so I would assume your profile might be still wrong.
Synchronize before each timer via:

torch.cuda.synchronize()
tX = time.perf_counter()

JingyuQian · July 19, 2021, 6:30pm

Thanks for the suggestion. I’ve changed the profiling code to the following:

### Profiling model run time

torch.cuda.synchronize()
start = time.perf_counter()

res = model(torch.from_numpy(img).type(torch.cuda.FloatTensor) / 255)

torch.cuda.synchronize()
end = time.perf_counter()
print(f'Run time: {end - start}')

### Profiling GPU-CPU copy time
torch.cuda.synchronize()
t1 = time.perf_counter()

pred = res['out'].cpu().detach().numpy()[0][0]
                    
torch.cuda.synchronize()
t2 = time.perf_counter()
print(f'CPU copy time: {t2 - t1}')

The result is (inside docker):
Run time: 10.947372228954919
CPU copy time: 0.0007527889683842659

I did another profiling with the following code to see how long the synchronization takes:

torch.cuda.synchronize()
start = time.perf_counter()

res = model(torch.from_numpy(img).type(torch.cuda.FloatTensor) / 255)

mid = time.perf_counter()

torch.cuda.synchronize()
end = time.perf_counter()
print(f'Run time: {end - start}')
print(f'Second sync call time: {end - mid}')

And got:
Second sync call time: 10.835095343994908

From my perspective the 2nd torch.cuda.synchronize() is taking a long time in the docker. Do you have any suggestions? Thanks.

ptrblck · July 19, 2021, 7:41pm

Both profiles show the same model runtime as ~10s, don’t they?
Note that mid in the second example cannot be used, as you are again not synchronizing.

I’m not sure, if I understand the question correctly, but the synchronization “takes a long time”, as the previous GPU operations aren’t finished and which is the reason you have to synchronize the code to properly profile the code.

JingyuQian · July 19, 2021, 7:51pm

Yes, I agree with you. As you mentioned in some other posts, GPU calls are asynchronous so syncing forces them to finish.

But back to my first post - I want to know why this 10 seconds happens only in the docker container?

ptrblck · July 19, 2021, 7:52pm

Could you post a code snippet including the synchronizations, which shows the 10s execution in the docker and the faster one outside, please?

JingyuQian · July 28, 2021, 10:56pm

Hi, sorry for the late reply. After a discussion with the team members I decided to ditch Docker. So this problem is not a concern any more. Thanks for you help!