A single float as second input to an image classifier

partysaurus · June 13, 2020, 10:26am

I would like to train an image classifier that uses a positive floating point number (1-dimensional data of size 1) as the second input during training and inference. I am not an expert on deep learning, but I have identified from literature that what I want to do is probably called “early fusion”, i.e. combining two different types of data (or modalities) when extracting features from an image.

The purpose of supplying the float is to help improve classification accuracy; the number on its own is not enough to make a prediction. A similar example would perhaps be providing the gender of a patient whose medical image is used to train a tumor classifier. The only difference would be that gender comes only in a few discrete values.

Examples I’ve seen in this forum appear to be more complicated than mine. For example, the second input might contain enough information to extract features or even train a second classifier, such as an MLP and then merge the predictions [1, 2, 3, 4]. Therefore, the only solution I could think of is that I concatenate the float y to a flattened image, like so:

# subclassed torchvision.models resnet50
def forward(self, x, y):
...
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = torch.cat((x, y), dim=1)
x = self.fc(x)  # nn.Linear(in_features + 1, num_classes)
return x

However, doing this has actually caused a drop in classification accuracy after training, when compared with a resnet50 model that accepts only image input. My intuition also tells me that the 1D float input should somehow be normalized.

I would be grateful for suggestions on how to approach this problem when writing the forward method. I have already written my own dataset with a custom __getitem__ method.

cskarthik7 · June 13, 2020, 10:47am

class Data(Dataset):
  def __init__(self,images,labels,transform=None):
    self.images=images
    self.labels=labels    #The floating point values for that images you want to return.
    # Encode the labels and then append the training images and labels together.
    # For eg : images[0]=directory of image 0 and labels[0]=label for that particular image
    self.transform=transform
  def __getitem__(self,idx):
    image_name = self.images[idx]
    image = cv2.imread(image_name)
    b, g, r = cv2.split(image)
    image = cv2.merge((r, g, b))
    if self.transform:
      image=self.transform(image)
    return image,self.labels[idx]
  def __len__(self):
    return len(self.images)

So instead of passing the floating point values through the network architecture, you can simply return the floating point values in the DataLoader itself.

partysaurus · June 13, 2020, 11:12am

Thanks for your reply. I think I’ve failed to communicate that the extra float is not the label. The image dataset is split into a few distinct categories and each image will have a positive float associated with it. So my dataset is something like this:

class CustomImageDataset(ImageFolder):
    def __getitem__(self, index):
        image, labels = super().__getitem__(index)
        # logic here to obtain the float here from csv file
        extra_float = torch.Tensor([extra_float])
        return image, extra_float, labels

The question is how to correctly use extra_float in the forward method during training so that it improves classification accuracy of objects split into a few categories.

cskarthik7 · June 13, 2020, 11:18am

Okay sorry for misunderstanding your question. Why do you want to integrate this float value in the image classification model? I guess it won’t do good. Concatenating float value with the last layer may lead to loss in the originality of features that were extracted.
Early fusion will be useful only if you concatenate the image with a texture or any feature.

partysaurus · June 13, 2020, 11:38am

The extra float is the magnification level for an image obtained using a microscope. Objects in each category can have different size in pixels, depending on magnification level. At the moment, small objects are sometimes incorrectly identified in images taken at a low-magnification level, although in reality they would be too small to be seen there.

Knowing the magnification level definitely helps when a human is performing the same task.