I am trying to reproduce the paper https://arxiv.org/abs/1903.05134, where the idea is to train 4 random width sub-networks in each iteration so that the full-network can execute at any widths. However, I found the forward and backward speed of the random width sub-network is very slow. The code for training is shown below
min_width, max_width = 0.25, 1.0 # min width 0.25x, max width 1.0x width_mult_list = [min_width, max_width] sampled_width = list(np.random.uniform(FLAGS.width_mult_range, FLAGS.width_mult_range, 2)) width_mult_list.extend(sampled_width) for width_mult in sorted(width_mult_list, reverse=True): model.apply( lambda m: setattr(m, 'width_mult', width_mult)) output = model(input) loss = criterion(output, target) loss.backward() optimizer.step()
The issue is that the speed of min_width and max_width is normal, but the two random widths are very slow(3-4 times slower).
However, if I fix the
samped_width to some random number, let’s say
sampled_width=[0.25, 0.6543, 0.76354, 1.0], the speed will be normal.
This is strange to me, since I think the speed should be similar. Is this caused by the dynamic graph in pytorch?
My pytorch version is 1.0.1