DataParallel: why only parallize the `model.feature` part of the vgg architecture?

I found that the imagenet classification example here has a special branch to handle vgg and alexnet which have dense fully connected layers.

After a simple benchmark on VGGNet I found that ONLY parallelizing the model.feature part is obviously faster than parallelizing the whole model.

So what’s the theory behind this behaviour?

Are fully connected layers not friendly to parallelism?

I think that the computation required for the fully connected layers is so small, that the overhead of the parallelization makes it slower if you parallelize.
Parallelization is only usefull if you have heavily waiting for computations to be performed.

Okay I got it.

So the computational overhead in FC layers is not so heavy, however the number of parameters is
extremely large. The will be much time on param/gradient sync.

1 Like