DataParallel: will modifications to `model.module` broadcast to all GPUs?

zeakey · June 6, 2019, 5:28am

In this case I made some modifacations to the parameters of a DataParallel wrapped model.

Such as

model = resnet50()
model = nn.DataParallel(model).cuda()
# several parameter reinitialization tricks here
model.module.conv1.weight.data = xxx

In my understanding there is a copy of the original model in each GPU device.

My question is: ‘will the aforementioned modifications to the nn.DataParallel wrapped model be broadcasted to all GPUs’?

And how about if I change the architecture of the model (such as replacing a conv3x3 in the original
model with dilated convolution)?

If it won’t be broadcasted to all other GPUs, how can I broadcast the modifications to all other divices
INSTANTLY?

ptrblck · June 6, 2019, 2:12pm

Would it be possible to apply your manipulations to the model before wrapping it to DataParallel?
I think this would be the cleaner approach.

zeakey · June 8, 2019, 2:44pm

Because I will dynamicaly change the module to perform some kind of online network pruning,it’s impossible to apply the manipulations before wrapping to DataParallel.

ptrblck · June 8, 2019, 3:29pm

Would re-wrapping into nn.DataParallel work in this case? Or are you manipulating the model during training?

zeakey · June 9, 2019, 6:25am

Yes I’m manipulating the model during training (after each epoch).

So I think re-wrapping the model to nn.Parallel would be okay.

model = nn.DataParallel(model)
# manipulate the model
model = model.module
model.conv1 = xxxxx
# re-wrap
model = nn.DataParallel(model)

Is this right?

ptrblck · June 9, 2019, 2:11pm

Re-wrapping after the epoch should be alright.
However, I would recommend to create some dummy example and make sure the manipulation an re-wrapping is really working, e.g. set all parameters to zero and check the parameters in the next iteration for these values.

kwanUm · February 24, 2020, 3:45pm

Does the re-wrapping technique works well for you? I’d imagine it’s slow to use as wrapping with dataparallel each time copies ALL its weights to other GPUs, rather than copying just the ones needed when pruning the weights.

blackberry · November 24, 2020, 10:10pm

I have a similar question: I am changing the requires_grad of module parameters after wrapping with DataParallel. In DistributedDataParallel, which is now recommended instead of DataParallel, there is a warning that says “don’t do it!”. But I do not see the same warning for DataParallel. I agree that the proper way is to re-wrap it after the change, but that is a little more code.