Hi I’m working on the image classification using pytorch.
I’m wondering if there is a way to mean the different model weight parameters.
I’m doing cv kfold=5, but the problem is I have only 9 hours of training time limit so I can only train one fold at a time and get 5 different ‘model.pt’ file.
Is there a way to load model.pt and mean them? I think this is the best approach I can do with this environment
Not sure, if taking a mean in the nicest solution.
But, I just want to write a function that does this, so here you go,
If I understood your question correctly.
def mean_models(paths):
the_net = model()
the_sd = {}
for path in paths:
net = model()
net.load_state_dict(torch.load(path, map_location='cpu'))
some_sd = net.state_dict()
for k in some_sd.keys():
if k in the_sd:
the_sd[k] += some_sd[k]
else:
the_sd[k] = some_sd[k]
for k in the_sd.keys(): the_sd[k] /= len(paths)
the_net.state_dict().update(the_sd)
return the_net
Haven’t run this, so just fix the bugs here and there, but the logic works.
thanks for the reply! right now I trained the dataset with 5 different fold (splitted the dataset considering their class imbalance). So isn’t this kfold cross validation if I mean these outputs? thanks!
I’m not sure that I’m saying the right thing but can you help me with the understanding of the kfold strategy?
right now I’m working on kaggle Imet competition. dataset class distribution is very imbalanced, so I splitted the datset by 5 fold considering the distribution. then I trained using 80% of the dataset and used 20% for validation.
everytime I train the model and validate using diffenrent fold I would get 5 different model. I’m trying to mean these…
Am I understanding correctly?..
Don’t average the weights of the networks! Then nothing makes sense anymore. For example you have two convolution kernels at the start of the network, they have been trained and fine-tuned to be specific features in the network. Averaging them will most likely not correspond to anything meaningful that the network has learned to deal with.
From the very little I’ve read from the iMet challenge, it is multi-class multi-label classification, right? Then I would suggest averaging the outputs of the model instead, if you really need to do it that way. If you have linear layers just take the sum and divide by the number of models you have, and try with that.
Another approach, which is better IMHO, is to train for 9h and save checkpoints just before the session stops (you need to save the model’s state dict and the optimizer’s also, maybe some other things that you need to train, for example on what step you last iterated in the dataset). Then you start from that point with the next session, instead of starting from scratch to train another suboptimal model.
Ensembles work, but I think they work better if the models have already reached a performance plateau.
OMG thanks for the kind reply
I was very confused with the model output and the model weights… they are totally different thing I get it.
I will then try to get the best performance using one model and get to the ensemble at the very last time.
thanks so much!