Slow training speed with dynamic network

I am trying to reproduce the paper, where the idea is to train 4 random width sub-networks in each iteration so that the full-network can execute at any widths. However, I found the forward and backward speed of the random width sub-network is very slow. The code for training is shown below

min_width, max_width = 0.25, 1.0  # min width 0.25x, max width 1.0x
width_mult_list = [min_width, max_width]
sampled_width = list(np.random.uniform(FLAGS.width_mult_range[0], FLAGS.width_mult_range[1], 2))
for width_mult in sorted(width_mult_list, reverse=True):
     model.apply( lambda m: setattr(m, 'width_mult', width_mult))
     output = model(input)
     loss = criterion(output, target)

The issue is that the speed of min_width and max_width is normal, but the two random widths are very slow(3-4 times slower).
However, if I fix the samped_width to some random number, let’s say sampled_width=[0.25, 0.6543, 0.76354, 1.0], the speed will be normal.
This is strange to me, since I think the speed should be similar. Is this caused by the dynamic graph in pytorch?
My pytorch version is 1.0.1