Theoretical: Multimodal training with missing data

Hello everyone:

I have a medical dataset with three different modalities:

  • Images
  • Tabular data
  • Genetic data (needs to be handled separately from tabular)

As much I would like to change this, I have a lot of missing data. The % patients with all the data is very small. For example, I only have images for about 10% of the patients.

On a normal multimodal approach, I would have 3 different “heads” of the network that would ingest each modality and somewhere in the middle I would concatenate it (or any other strategy) to send it again through a series of FC layers until the output layer.

If I were to have no missing data, I would just do a forward pass, compute the error and backpropagate it to update all the model’s weights. With so many missing values, if I were to send, for example, an image containing all black whenever I don’t have that modality, I would be biasing my model.

I came up with the idea of having different optimizers for the different “heads” of the network and maybe another that would update all the weights of the shared architecture. I would switch on/off the heads if I don’t have a specific modality and after the forward pass I would call optimizer_n.step() for all the modalities that I did have data. Thus updating only the weights of the nuerons that were used.

Is this a reasonable strategy? I have not got much experience working much with multiple optimizers nor multiple modalities so any advice is greatly welcomed.

Missing modality in training is NOT a trivial problem. It is a cutting-edge research topic. You might want to check out the recent multimodal VAE or meta-learning papers (e.g. Wu & Goodman 2018; Ma et al. 2021; Joy et al. 2022) for possible directions.

Thank you very much for your answer. I was able to easily find the paper of Wu & Goodman 2018 and Ma et al. 2021. But I am having some trouble finding the last one. Would you mind sharing the paper title?

It’s named “LEARNING MULTIMODAL VAES THROUGH MUTUAL SUPERVISION”. If you’re working with biomedical data I do believe it would be a lot noisier than the image-text or image-label or image-image data people usually benchmark with.