Combining image and scalar value as inputs

talkingtoaj · January 5, 2020, 4:09pm

I have been training a resnet34 Convolutional network using fastai which classifies images of rooms of a house as to ‘kitchen’,‘bathroom’,‘bedroom’ etc. I am trying to predict the real estate selling price of this property from the photos. I have a seperate NN which does reasonably well using other advertisement data, but I want to see how the photos can add to this.

Hence, I can provide the initial price estimate as a scalar value in parallel to the photo of the room, using my previously trained Convolution NN as a starting point (transfer learning)

I’m new to Pytorch and trying to get a better grasp at it as I go deeper than what fastai provides. How would I define my nn.Module class to receive an image and scalar value as the inputs, then forward the scalar value past the Convolutional resnet34 to be combined with it’s top layer for a final 1 or 2 layer before attempting an improved final price prediction scalar output?

aauker · January 5, 2020, 7:25pm

Great question- covariates indeed can be useful, but also tricky to implement correctly. I assume you have a nn.Module class that includes resnet34 as one of its members with a few linear layers. Something like:

self.resnet = resnet34()
self.regression = nn.Linear(1000,1)

to predict (I’m guessing) price (output of self.regression).

It’s actually quite easy in principle to do what you want, but the implementation might take some optimization. Generally, you need to pass both the input image (as a 4D tensor) and price (as a 2D tensor) to the forward method of your module class. Resnet outputs 1000 linear dimensions or features, so after this you can concat the price as a feature to this output.

Though there are some considerations - for example, it probably will be best to reduce the dimensionality a bit before adding the price into the final layers. Additionally, you need to think about normalization of the price feature, is it from [-1,1], [0,1]? Or perhaps a bit more complicated, perhaps as a on-hot encoded price of 0 or 1, or a couple bits… it depends on your use-case. Though if you have a NN output, you can try just passing that straight through and see what happens first.

Here’s a general scheme –

# Your resnet module
self.resnet = resnet34()

# Reduce dimension from resnet output 
self.reduce = nn.Linear(1000,100)

# Some final regression layer
self.regression = nn.Linear(101,1)

def forward(input_image, price):
    # Output from resnet has shape (N, 1000)
    out = self.resnet(input_image)

    # Reduce dimension, now has shape (N, 100)
    out = F.relu(self.reduce(out))
    
    # Concat price along feature dimension, shape (N, 100+1)
    out = torch.cat([out, price], dim=1)
    
    # Some final regression 
    out = self.regression()
    return out