Faster R-CNN transformed input

Torchivison’s model uses ResNet51+FPN as a feature extractor.

I usually transform images by first converting them to a tensor, and then multiplying again by 255

    t_ = transforms.Compose([
                    transforms.ToPILImage(),
                    transforms.Resize(img_size),
                    transforms.ToTensor(),
                    transforms.Normalize(mean=[0.407, 0.457, 0.485],
                    std=[1,1,1])])
   
   img = 255*t_(img)

When I did that with an image input to Faster R-CNN, the result was None, but when I removed the multiplication, it seems to be working fine. It that a ResNet thing?

Could you post a code snippet, which reproduces this issue?

This:

    threshold = 0.75
    im = PILImage.open(im)
    img = np.array(im)
    img = transforms.ToTensor()(img)
    print(img.size())
    out = frcnn_model([img])
    print(out)
    scores = out[0]['scores']
    bboxes = out[0]['boxes']
    classes = out[0]['labels']
    best_idx = np.where(scores>threshold)
    best_bboxes = bboxes[best_idx]
    best_classes = bboxes[best_idx]
    if len(best_idx)>0:
       plt.imshow(im)
       ax=plt.gca()
       for b in best_bboxes:
           rect = Rectangle((b[0],b[1]), b[2]-b[0], b[3]-b[1], linewidth=2, edgecolor='r', facecolor='none')
           ax.add_patch(rect)
       plt.show()

Also, it seems that Faster R-CNN requires an RGB input not BGR, at least that seems from the normalization step:


        if image_mean is None:
            image_mean = [0.485, 0.456, 0.406]
        if image_std is None:
            image_std = [0.229, 0.224, 0.225]
        transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)

Your code snippet neither seem to use the defined transformation nor the multiplication after the transformation.

The pretrained torchvision models should use RGB inputs by default.

I get empty output if I use the multiplication by 255 in the transform method.

In the tutorials I’ve found the input image pixels are between 0 and 1, as returned by the ToTensor() transform. So what should they be?

We would still need a code snippet to further debug the issue.

I would recommend to stick to the tutorial and use the train transformation. Note that the normalization will be done in side the model as seen here.

I was referring to the same lines of code as you.
Obviously the model takes inputs between 0 and 1.
That’s in contrast to VGG16 model, that takes inputs between 0 and 255

vgg16 takes in a normalized input as all classification models:

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225] . You can use the following transform to normalize:

OK it’s just for backtransform then

Thanks!

OK I’m at loss. I tried FCN8s, without multiplying by 255 I get all 0s, with 255 I get a good solution

fcn8s = fcn_models.FCN8s(n_class = len(pascal_object_categories))

fcn8s.load_state_dict("fcn8s_from_caffe.pth")
print(fcn8s)

#evaluate the pretrained FCN8s model on one image
def deploy_fcn_model(im):
    im = PILImage.open(im)
    img = np.array(im)
    # these mean values are for RGB
    t_ = transforms.Compose([
                    transforms.ToPILImage(),
                    transforms.ToTensor(),
                    transforms.Normalize(mean=[0.485, 0.457, 0.407],
                                        std=[1,1,1])])
    
    #multiply by 255 for the network input
    img = 255*t_(img) 
    img.unsqueeze_(0)
    if device == torch.device("cuda"):
        img = img.to(device)
    # get the output from the model
    output = fcn8s(img)    
    #remove from cuda, convert to numpy, squeeze
    out = output.argmax(1).squeeze_(0).detach().clone().cpu().numpy()    
    plt.imshow(out)
    plt.show()
    #load the image
    bgr_img = cv2.imread("dogcat1.jpg")
    # convert FCN8s pixelwise predictions to color array
    color_array = np.zeros([out.shape[0], out.shape[1],3], dtype=np.uint8)
    for id in np.unique(out):
        print(id)
        if id == 8:
            color_array[out==id] = [255,0,0]
        elif id == 12:
            color_array[out==id] = [0,255,0]            

    #overlay images
    added_image = cv2.addWeighted(bgr_img, 0.5, color_array,0.6, 0)
    #plot
    plt.imshow(added_image)
    plt.show()
    
    
deploy_fcn_model("dogcat1.jpg")

Why is this happening?

Could you post the model definition or the repository?
Maybe this model was trained with unnormalized values?

For FCN I used the weights from https://github.com/wkentaro/pytorch-fcn
Is squeezing inputs between 0 and 1 specific to pytorch?

No, it’s common to normalize the inputs for a lot of machine learning models, as this might accelerate and stabilize the training.
Some methods e.g. RandomForest classifiers are not sensitive to the input range, while e.g. neural networks are.

You would have to check the dataset creation (or just get a single sample) and check the range of the inputs the model was trained on.