The ‘aux’ layer is used only for training. On inference time, you have just the output of the final layer.

Please correct me if I’m wrong: there’s no need to do the mandatory normalization (“The images have to be loaded in to a range of [0, 1] and then normalized using mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]” since it’s already included in the model (as long as transform_input is set to True:

For those who are still stuck on this issue (from here and then):

if isinstance(outputs, tuple):
    loss = sum((criterion(o,labels) for o in outputs))
    loss = criterion(outputs, labels)

If this is true than the master documentation needs to be changed. It states: “All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]”.

