Multiple identical model training cause loss to NaN

Haojie_Fei · December 20, 2018, 2:49pm

My model is VGG16 and part of the model is put on GPU0 , the rest is put on GPU1.

Then I create three models using this model definition.
And do the followings

for i in range(10):
model_weighVersion1_train()
model_weighVersion2_train()
model_weighVersion3_train()

I find if there is only one model in training, everything will be okay. But if these three models are trained in turns, after some iterations, the loss will become NaN.

So I wonder when I create multiple identical models, will these models share memory/storage?
Or do you know what led to this NaN?

ptrblck · December 22, 2018, 9:10am

Could you post the code for your models?
If you didn’t implement some weight sharing, your models should be completely independent.
Also, could you share your training code, if that’s possible?
Is this behavior reproducible?

Haojie_Fei · December 22, 2018, 9:56am

Thanks, I find the bug is that I want to use different streams to overlap computation and data transferring in my model definition but stream will run in an uncontrolled way actually. I know tensorflow will put computation and data transferring in different streams by default to try to overlap. But it seems not easy to do this overlapping in Pytorch.
Anyway, Thanks for your response.