You should set the require_grad
attribute of all parameters of modelA
and modelB
to False
and leave it as True
for the final classifier.
Hello.
Is it possible to save such an ensemble model using JIT (provided that probably all the models are in JIT)?
Thank you.
Hello @ptrblck,
def init(self, encoded_image_size=14):
super(Encoder, self).init()
self.enc_image_size = encoded_image_size
resnet = torchvision.models.resnet101(pretrained=True)
modules = list(resnet.children())[:-2]
self.resnet = nn.Sequential(*modules)
densenet = torchvision.models.densenet121(pretrained='imagenet')
modules_1 = list(densenet.children())[:-2]
self.resnet = nn.Sequential(*modules_1)
self.adaptive_pool = nn.AdaptiveAvgPool2d((encoded_image_size, encoded_image_size))
def forward(self, images):
out = self.resnet(images)
out1 = self.densenet (densenet)
#after tweaking dimensions...
x = torch.cat(out,out1)
z = self.adaptive_pool(x)
is this code snippet makes sense!!!?
The code looks generally OK, but I wouldn’t recommend to create new models via passing the child modules to nn.Sequential
.
Using this approach you would call each submodule in a sequential way and would therefore lose all functional calls, which were used in the original forward
.
E.g. for DenseNet
you would lose these calls.
If you just want to remove the last layer, replace it with nn.Identity
.
On the other hand, if you want to manipulate more layers or the forward
pass in general, I would recommend to create a custom model, derive from the corresponding torchvision
model, and change the forward
appropriately.
@ptrblck, Can you please clarify about backward(), i.e. error will be flowing to both modelA and modelB or it will be limited upto ensemble model only?
One more thing please, if I will save ensemble model, will it save weights including modelA and modelB or only linear layer of ensemble model.
The gradients will be calculated for the parameters of all submodules as well as the final classifier, if you didn’t disable it via .requires_grad=False
.
Also the state_dict()
will return all parameters of all submodules and the final classifier.
So, If I want to use modelA just in inference mode and modelB and linear layer of ensemble model in training mode, can I use this way or gradient will flow in modelA as well just because it is is forward pass of ensemble model?
class MyEnsemble(nn.Module):
def __init__(self, modelA, modelB):
super(MyEnsemble, self).__init__()
self.modelA = modelA
self.modelB = modelB
self.classifier = nn.Linear(4, 2)
def forward(self, x1, x2):
with torch.no_grad():
x1 = self.modelA(x1)
x2 = self.modelB(x2)
x = torch.cat((x1, x2), dim=1)
x = self.classifier(F.relu(x))
return x
This code would not calculate any gradients for modelA
. If that’s your use case, then you should be fine.
@ptrblck Hello I had doubt about the MyEnsemble class when the loss.backward() is called will the MyModelA and MyModelB (layers inside them) weights updated? Thanks in advance
Yes, in my example both sub-models would be updated.
These models are treated as any other layer and you could imagine swapping them for e.g. nn.Linear
.
Thanks alot for replying!
Lets say I load a ModelA with pre-trained weights, and then add some additional layers operating on outputs of ModelA layers. I only want to train these new layers (at least for starters).
I understand that I need to set the requires_grad
to False
for ModelA, and to True
for the new layers.
But should I also set ModelA to eval
mode during training? So that the BatchNorm layers in it behave as they would during inference? Or will they do that anyway with requires_grad=False
?
Hi, I have a model A which is trained on 50 classes and now I want to add 10 more different classes so how can I train my model without effecting the previous results ?
Finally I want to predict 60 classes and I don’t want to re-train my model from very beginning.
Thanks
Usually you would either replace the last linear layer to increase the number of classes.
Even if you copy the pretrained (partial) parameters to this layer, you would still retrain all class outputs, if you don’t manually zero out the gradients of the pretrained weights.
If you don’t want to “touch” the previously trained outputs at all, you could probably add a completely new linear layer for the new classes and use the penultimate features as its input.
Afterwards you could concatenate the “old” and “new” linear layer outputs to calculate the loss.
This would be similar to the aforementioned approach, but would avoid having to zero out the gradients etc. manually and might be a bit easier to implement.
Let me know, if this would work for you.
Hi, were you able to find an answer to your question?
I did not, no. But I decided to set the frozen parts of the network (where requires_grad
is False
) to eval
mode. And I overloaded the train(..)
method to keep it that way when calling train()
or eval()
on the network from the training script.
Thanks,
I am using Faster RCNN pretrained model and I am trying to add new layer using this code:
for p in model.parameters():
p.requires_grad=False
model.roi_heads.box_head.fc8 = torch.nn.Linear(in_features, 1024,bias=True)
I am adding 8th layer and trying to freeze parameters of all previous layers but I am getting this error.
Element 0 of tensors does not require grad and does not have a grad_fn
please explain this along with sample code,Thanks in advance
model.roi_heads.box_head
returns the TwoMLPHead
module, which flattens the input and applies two nn.Linear
layers with a relu non-linearity.
You won’t be able to just add a new attribute (fc8
) to this module without changing the forward
method as well.
The proper workaround would be to create a custom TwoMLPHead
layer with 3 linear layers and replace box_head
with it.
A hacky workaround would be to redefine fc7
as an nn.Sequential
container and add the new linear layer to it:
model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.roi_heads.box_head.fc7 = nn.Sequential(
model.roi_heads.box_head.fc7,
nn.ReLU(),
nn.Linear(model.roi_heads.box_head.fc7.in_features, model.roi_heads.box_head.fc7.out_features)
)
Could you extend this small example to show how to apply soft-parameter sharing across modelA and modelB?