Train vs eval mode

I have a basic question.

I’m trying to learn visuomotor policy for robot application, where the robot takes a series of images as inputs and the policy network outputs motor velocities to perform a target task.

I imported the pretrained ResNet18 model from torchvision for feature extraction,
and the features are fed into the following fully connected layers.

In training phase, the imported ResNet18 runs by .train() mode, whereas .eval() mode in test phase.

What I want to know is that if I use the ResNet18 for feature extraction only(not fine-tune), should I set it to .eval() mode in training phase? or my current approach is correct?
(I know the difference between the modes is about the batch norm and dropout)

Or, .train() mode with freezing all the layers is enough?

Though there are a lot of topics about train vs eval mode, but little bit confusing in details…