Why not giving the whole model to DataParallel in the imagenet example?

I have seen this question many times in different questions about DataParallel, but no one can give an explicit answer. So I question it again in a new topic, and hope anyone could answer it. Does this operation have any special purposes?
This expression is borrowed from @trypag. https://discuss.pytorch.org/t/dataparallel-and-cuda-with-multiple-inputs/272/3

Could anyone explain this code extracted from the imagenet example :

if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
    model.features = torch.nn.DataParallel(model.features)
    model = torch.nn.DataParallel(model).cuda()

Is there a specific reason to separate the classifier and the features in the alexnet and vgg models ?
Why not giving the whole model to DataParallel, like in the resnet model ?


The answer is in One weird trick for parallelizing convolutional neural networks by Alex Krizhevsky.


Oh! Thank you very much!

I read the paper yet, but still feel confused.
Is that means “model.features = torch.nn.DataParallel(model.features)” corresponding to Model parallelism, and “torch.nn.DataParallel(model)” corresonding to Data parallelism?
But why “args.arch.startswith(‘alexnet’) or args.arch.startswith(‘vgg’)” ,then do Model Parallelism, else do Data parallelism?