Torch.jit.trace only for inference? not training?

BongHoe_Koo · March 31, 2022, 1:45am

Hi PyTorch Team,

i am currently struggling to figure out what is the problem.
i made custom vgg16 model in python environment and export model with using torch.jit.trace (No training)
and then i tried to train this model in c++ but it seems that accuracy is not the same as python environment

why?

※ Also i found specific model can’t train in c++ (another custom model works well but only custom vgg16 model accuracy seems weird, of course i used exact same training loop in c++ )

Python result :
Epoch: 0001 cost = 0.988920689 acc = 0.840617180
Epoch: 0002 cost = 0.837770224 acc = 0.981513083

Process finished with exit code -1

C++ result :

gpu enabled
current epoch = 0
average accuracy : 0.264888
average cost : 1.38596
current epoch = 1
average accuracy : 0.268539
average cost : 1.38558
current epoch = 2
average accuracy : 0.267135
average cost : 1.38528
current epoch = 3

Vgg16 Model (Model Structure)

github.com

gellston/DeepLearningStudy/blob/main/torch/model/VGG16FC.py

import torch

class VGG16FC(torch.nn.Module):

    def __init__(self, class_num=5):
        super(VGG16FC, self).__init__()
        self.drop_rate = 0.3
        self.class_num = class_num

        self.layer1 = torch.nn.Sequential(torch.nn.Conv2d(3, 64, kernel_size=3, stride=1, padding='same'),
                                          torch.nn.BatchNorm2d(64),
                                          torch.nn.ReLU(),
                                          torch.nn.Conv2d(64, 64, kernel_size=3, stride=1, padding='same'),
                                          torch.nn.BatchNorm2d(64),
                                          torch.nn.ReLU(),
                                          torch.nn.MaxPool2d(kernel_size=2, stride=2))

        self.layer2 = torch.nn.Sequential(torch.nn.Conv2d(64, 128, kernel_size=3, stride=1, padding='same'),
                                          torch.nn.BatchNorm2d(128),
                                          torch.nn.ReLU(),

This file has been truncated. show original

TrainingScript(Include save model)

github.com

gellston/DeepLearningStudy/blob/main/torch/torch_fiat_classification_food.py

import torch
import torch.nn as nn
import random
import cv2
import numpy as np

from torchsummary import summary
from torch.utils.data import DataLoader

from model.VGG16FC import VGG16FC
from util.FIATClassificationDataset import FIATClassificationDataset


USE_CUDA = torch.cuda.is_available() # GPU를 사용가능하면 True, 아니라면 False를 리턴
device = torch.device("cuda" if USE_CUDA else "cpu") # GPU 사용 가능하면 사용하고 아니면 CPU 사용
print("다음 기기로 학습합니다:", device)


# for reproducibility
random.seed(777)

This file has been truncated. show original

C++ repository is private access now . so if you guys want to see c++ code also
then i can give source code privately

Thanks.

ptrblck · March 31, 2022, 5:41am

torch.jit.trace will use the provided input to record all functions as they were executed and is thus unable to use any data-dependent control flow etc.
In your particular case all dropout layers would be “fixed” which is most likely one of the reasons the training fails using the traced model. torch.jit.script on the other hand should be able to track these operations as well.

BongHoe_Koo · March 31, 2022, 6:30am

Thanks for the good comments

i just figured out , i trained at least one epoch before save model
then training fail disappeared

My guess is that through 1 epoch training,
torch.jit.trace was able to trace the training process.
Is this guess correct?

Also, is 1 epoch training before saving the model likely to be an issue in the future? (potential issue?)
If there are no potion issues, I have no problem using this method now.

Thanks.

ptrblck · March 31, 2022, 6:35am

No, I don’t think training the model for one epoch should change anything and cannot explain why it seems to work now. trace would still keep the layers in their traced “state”, i.e. it would still keep the same dropout mask etc., which would still be concerning.

BongHoe_Koo · March 31, 2022, 6:40am

ummmm If you’re right, I won’t be using dropouts in the future. However, unlike the dropout issue, the model saved after learning 1 epoch seems to be learning smoothly in C++. I will post related screenshots.

ptrblck · March 31, 2022, 6:44am

The fix would be to use torch.jit.script instead of torch.jit.trace so I wouldn’t disable dropout in the future.

BongHoe_Koo · March 31, 2022, 6:46am

I think I misunderstood what you said earlier. I’m sorry.
Is there a way to save the model using torch.jit.script?

ptrblck · March 31, 2022, 7:14am

Yes, just replace torch.jit.trace with torch.jit.script in your Python script and save the model afterwards. This section of the tutorial might be interesting to take a look at.

BongHoe_Koo · March 31, 2022, 7:52am

Thanks @ptrblck its working!!