Runtime densenet vs resnets

hi guys,

one of the most compelling new vision architectures are the densenet architectures. According to the authors they can get the same performance as Resnets but with much fewer FLOPS. If you look at Figure 3 in the paper, ResNet-50 and DenseNet-169 get roughly the same ImageNet validation error, but Densenet-169 uses 0.6 * 10^10 flops vs 0.8 * 10^10 for ResNet-50; a 25% improvement!

Using the build-it PyTorch models, I tried to verify this by doing the forward pass on ImageNet using one TITAN X GPU. The commands are simply:

CUDA_VISIBLE_DEVICES=2 python /data/imagenet/ --evaluate --arch resnet50
CUDA_VISIBLE_DEVICES=2 python /data/imagenet/ --evaluate --arch densenet169

However, running this gives me pretty much identical runtimes of 0.5 seconds average batch time (last line of output for resnet is below).

Test: [190/196]. Time 0.433 (0.557) Loss 26.8679 (30.0608) Prec@1 0.000 (0.104) Prec@5 0.000 (0.497)

Can anyone explain what is going on? I can think of several explanations, the most likely being that not all FLOPS are created equal. Maybe ResNets require more flops but they are FLOPS that are part of e.g. a convolution for which there are highly optimized CUDA methods.

thanks a bunch!

flops != runtime. While theoretical flops are improved, actual practical runtime depends on how the GPUs are occupied, whether densenets can hit GPU theoretical peak flops, etc. Considering these, I expect a 25% improvement to easily be masked behind the more complicated kernel launch pattern.


What happens if you scale them both up or down? What is the GPU utilization?