Can someone point out what are the advantages of this implementation of DropConnect over a simpler method like this:
for i in range(num_batches):
orig_params = []
for n, p in model.named_parameters():
orig_params.append(p.clone())
p.data = F.dropout(p.data, p=drop_prob) * (1 - drop_prob)
output = model(input)
for orig_p, (n, p) in zip(orig_params, model.named_parameters()):
p.data = orig_p.data
loss = nn.CrossEntropyLoss()(output, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Donât use .data these days! Itâs bad for you, really! with torch.no_grad(): will do fine.
In fact, you probably should copy back the original values between the backward and the step. For a reasonably complicated model (more than one layer), your gradients might be off from this and if you had stayed clear of using .data PyTorch would have told you.
You multiply with 1- drop_prob, which seems unusual.
The convenience of a wrapper: In any single instance is easily done manually, but now you would want to apply this to some weights rather than all. Is it still as straightforward?
The safety of a wrapper: Having a well-tested wrapper saves you from implementation mistakes. (See above.)
Iâm forced to use p.data, because if I replace p.data with p, the weights wonât be modified. For example, if you print p values before and after the dropout line, you will see that only p.data method works (zeros out some weights). Unless you mean something else?
Good point about moving weight restore after backward call! Thank you.
I have to multiply by 1 - drop_prob because dropout scales its input internally by 1 / (1 - drop_prob), and if I donât do this the accuracy drops sharply: with drop_prob=0.05 it does not even converge if I donât scale back the weights. Iâm not sure whatâs going on, but I suspect it might have something to do with batchnorm. Any ideas?
What do you mean âis it still straightforward?â With my method itâs much easier to apply dropconnect selectively, I donât have to create wrappers for every single layer type, and I donât have to modify my model forward function. I agree with your point about safety.
No. p = something just doesnât overwrite elements of p, but instead assigns a new thing to the name p. Thatâs inherent in how Python works, you want p.copy_(...).
There are extremely few reasons to use p.data, and chances are youâre doing it wrong if youâre using it. (And people are getting serious about removing it properly, so hopefully itâll go away soon.)
For the scaling, I donât know. From a cursory look at the Gal and Ghahramani paper, maybe they also use the plain Bernoulli. Iâd probably multiply with torch.bernoulli(weight, 1-drop_prob) instead of using dropout and scaling.
Ok, it makes sense, I replaced all âp.data =â with âp.copy_â and added no_grad() context. No difference in performance that I can see, but if itâs safer, so be it.
I ran a few experiments with scaling, and yes, it seems like scaling is necessary, otherwise batchnorm will screw things up during inference. Recomputing batch statistics during inference also fixes the issue (same good accuracy with or without scaling), but thatâs obviously not a solution. Not sure how to use torch.bernoulli did you mean binomial? Tried generating binomial masks, but I donât see a good way to generate them quickly on GPU. I could only do mask = binomial_distr.sample(p.size).cuda() and this is very slow.