Torch.jit.trace only for inference? not training?

Hi PyTorch Team,

i am currently struggling to figure out what is the problem.
i made custom vgg16 model in python environment and export model with using torch.jit.trace (No training)
and then i tried to train this model in c++ but it seems that accuracy is not the same as python environment

why?

※ Also i found specific model can’t train in c++ (another custom model works well but only custom vgg16 model accuracy seems weird, of course i used exact same training loop in c++ )

Python result :
Epoch: 0001 cost = 0.988920689 acc = 0.840617180
Epoch: 0002 cost = 0.837770224 acc = 0.981513083

Process finished with exit code -1

C++ result :

gpu enabled
current epoch = 0
average accuracy : 0.264888
average cost : 1.38596
current epoch = 1
average accuracy : 0.268539
average cost : 1.38558
current epoch = 2
average accuracy : 0.267135
average cost : 1.38528
current epoch = 3

Vgg16 Model (Model Structure)

TrainingScript(Include save model)

C++ repository is private access now . so if you guys want to see c++ code also
then i can give source code privately

Thanks.

torch.jit.trace will use the provided input to record all functions as they were executed and is thus unable to use any data-dependent control flow etc.
In your particular case all dropout layers would be “fixed” which is most likely one of the reasons the training fails using the traced model. torch.jit.script on the other hand should be able to track these operations as well.

Thanks for the good comments

i just figured out , i trained at least one epoch before save model
then training fail disappeared

My guess is that through 1 epoch training,
torch.jit.trace was able to trace the training process.
Is this guess correct?

Also, is 1 epoch training before saving the model likely to be an issue in the future? (potential issue?)
If there are no potion issues, I have no problem using this method now.

Thanks.

No, I don’t think training the model for one epoch should change anything and cannot explain why it seems to work now. trace would still keep the layers in their traced “state”, i.e. it would still keep the same dropout mask etc., which would still be concerning.

ummmm If you’re right, I won’t be using dropouts in the future. However, unlike the dropout issue, the model saved after learning 1 epoch seems to be learning smoothly in C++. I will post related screenshots.

The fix would be to use torch.jit.script instead of torch.jit.trace so I wouldn’t disable dropout in the future.

I think I misunderstood what you said earlier. I’m sorry.
Is there a way to save the model using torch.jit.script?

Yes, just replace torch.jit.trace with torch.jit.script in your Python script and save the model afterwards. This section of the tutorial might be interesting to take a look at.

Thanks @ptrblck its working!!