Asynchronous execution of ResNet but not DenseNet

I’m having difficulty getting a dense-net like architecture to run asynchronously with some CPU tasks (e.g. a data loader). I subsequently noticed that even the pytorch-provided dense-net model has the same problem (code and results below).

The main difference between the layers contained in the two architectures is the amount of residual connectivity and the transition layers. ResNet uses strided convolution as opposed to pooling. So I’m wondering if perhaps it is the pooling causing the issue.

In my example below I’ve rigged the parameters so that the CPU job can be finished in about the same time as the GPU job, and so the run-time with CPU should be minimally impacted (if at all). However, the denseNet example runs in about the run-time of the CPU job + the GPU job, implying that it is not running asynchronously.

Edit: I tried replacing the pooling in my network with a strided convolution and it did not help.

import torch
import time as tm

batchSize = 48
device = 'cuda:0'
rank = 0
img = torch.rand(batchSize,3,260,192,device=device)
n = 3500
B = torch.rand(n,n, device = 'cpu')
A = torch.rand(n,n, device = device)

model = torch.hub.load('pytorch/vision:v0.6.0', 'resnet50', pretrained=True)
model.to(device)
model.train()
ext = ''
for epoch in range(10):
    epochTime = tm.time()
    t = tm.time()
    sampleCount = 0
    sampleTotal = 0
  
    for i in range(0,45):#adjusted to run in approx 10 seconds per epoch
        
        for q in range(0,4):#adjusted to run in approx 10 seconds per epoch
            out = model(img)

        if(epoch>5):#After 5 epochs start doing CPU activity
            ext = ' with CPU'
            for q in range(0,2):#adjusted to run in approx 10 seconds per epoch
                D = torch.mm(B, B)
    torch.cuda.synchronize()
    print('ResNET Rank %i train epoch %i total time is %f %s' %(rank,epoch + 1,tm.time()-epochTime,ext))

model = torch.hub.load('pytorch/vision:v0.6.0', 'densenet121', pretrained=True)
model.to(device)
model.train()

ext = ''
for epoch in range(10):
    epochTime = tm.time()
    t = tm.time()
    sampleCount = 0
    sampleTotal = 0
  
    for i in range(0,45):#adjusted to run in approx 10 seconds per epoch
        
        for q in range(0,4):#adjusted to run in approx 10 seconds per epoch
            out = model(img)

        if(epoch>5):#After 5 epochs start doing CPU activity
            for q in range(0,2):#adjusted to run in approx 10 seconds per epoch
                ext = ' with CPU'
                D = torch.mm(B, B)
    torch.cuda.synchronize()
    print('DenseNET Rank %i train epoch %i total time is %f %s' %(rank,epoch + 1,tm.time()-epochTime,ext))

ResNET Rank 0 train epoch 1 total time is 11.131998
ResNET Rank 0 train epoch 2 total time is 9.522481
ResNET Rank 0 train epoch 3 total time is 9.522765
ResNET Rank 0 train epoch 4 total time is 9.518584
ResNET Rank 0 train epoch 5 total time is 9.519885
ResNET Rank 0 train epoch 6 total time is 9.525411
ResNET Rank 0 train epoch 7 total time is 10.564333 with CPU
ResNET Rank 0 train epoch 8 total time is 11.599154 with CPU
ResNET Rank 0 train epoch 9 total time is 10.451248 with CPU
ResNET Rank 0 train epoch 10 total time is 10.926621 with CPU

DenseNET Rank 0 train epoch 1 total time is 8.739759
DenseNET Rank 0 train epoch 2 total time is 8.741329
DenseNET Rank 0 train epoch 3 total time is 8.739949
DenseNET Rank 0 train epoch 4 total time is 8.737826
DenseNET Rank 0 train epoch 5 total time is 8.736841
DenseNET Rank 0 train epoch 6 total time is 8.725820
DenseNET Rank 0 train epoch 7 total time is 15.354975 with CPU
DenseNET Rank 0 train epoch 8 total time is 15.407053 with CPU
DenseNET Rank 0 train epoch 9 total time is 15.394043 with CPU
DenseNET Rank 0 train epoch 10 total time is 15.323986 with CPU